Big Data in Biology: Personal Genomes
Series Introduction: I attended the Keystone Symposia Conference: Big Data in Biology as the Conference Assistant last week. I set up an Etherpad during the meeting to take live notes during the sessions. I’ve compiled all the abstracts, notes and slides (where available) here. Shout-out to David Kuo for helping edit the notes.
Warning: These notes are somewhat incomplete and mostly written in broken english
Other posts in this series: #
- Big Data in Biology
- Big Data in Biology: Large-scale Cancer Genomics
- Big Data in Biology: Databases and Clouds
- Big Data in Biology: Big Data Challenges and Solutions: Control Access to Individual Genomes
- Big Data in Biology: Imaging/Parmacogenomics
- Big Data in Biology: The Next 10 Years of Quantitative Biology
Personal Genomes #
Tuesday, March 25th, 2014 8:30am - 12:00pm
http://ks.eventmobi.com/14f2/agenda/35704/288359
Speaker list #
Lincoln D. Stein, Ontario Institute for Cancer Research, Canada
The International Cancer Genome Consortium Database -
[Abstract]
[Q&A]
Ajay Royyuru, IBM T.J. Watson Research Center, USA
Genome Analytics with IBM Watson -
[Abstract]
[Q&A]
Mark Gerstein, Yale University, USA
Human Genome Analysis -
[Abstract]
[Q&A]
[slides]
Stuart Young, Annai Systems Inc., USA
The BioCompute Farm: Colocated Compute for Cancer Genomics -
[Abstract]
[Q&A]
Adam Butler, Wellcome Trust Sanger Institute, UK
Short Talk: Pan-Cancer Analysis of Somatic Variation from Whole Genome ICGC / TCGA Datasets -
[Abstract]
[Q&A]
Maya M. Kasowski, Yale University, USA
Short Talk: Extensive Variation in Chromatin States Across Humans -
[Abstract]
[Q&A]
Robert L. Grossman, University of Chicago, USA
Short Talk: An Overview of the Bionimbus Protected Data Cloud -
[Abstract]
[Q&A]
The International Cancer Genome Consortium Database #
Lincoln D. Stein, Ontario Institute for Cancer Research, Canada #
Abstract #
The International Cancer Genome Consortium (ICGC; www.icgc.org) http://www.icgc.org/ is a multinational effort to identify patterns of germline and somatic genomic variation in the major cancer types. Currently consisting of 71 cancer-specific projects spanning 18 different countries, ICGC has sequenced the tumor and normal genomes of over 10,000 donors (>20,000 genomes). When the current phase of the project is completed in 2018, we expect to have sequenced more than 25,000 donors.
All analyzed data from the project is available to the public, including clinical information about the donors, somatic mutations identified in the tumors, and the potential functional significance of these mutations. The raw sequencing data and other potentially-identifiable information is available to researchers who have signed an agreement promising not to attempt to identify the donors. The total data set is now 500 terabytes in size, but growing rapidly as the project switches from exome sequencing (sequencing just the transcribed regions of the genome) to whole-genome sequencing. We anticipate that the full data set will be on the order of 10 petabytes.
To maximize the utility of the data to the public, the analyzed data is available at the ICGC data portal (dcc.icgc.org) http://dcc.icgc.org/, where users can browse donors, mutations and genes using an attractive highperformance web application based on Elastic Search at the backend and AngularJS and D3.js on the front end. The portal uses faceted search as its dominant user interface metaphor. This allows researchers to pose general queries, such as “find all non-synonymous mutations” and then successively refine them “…affecting genes in the hedgehog pathway”, “…affecting donors with stage I disease.” A series of interactive graphics allows researchers to readily compare different sets of mutations, donors and genes.
A limitation of ICGC is that the raw sequencing data must still be downloaded from a static file repository. We are addressing this limitation by moving the data into the compute cloud, where software and data can be co-resident. In the Whole Genome Pan-Cancer Analysis Project, which began earlier this year, 2000 whole genome pairs from ICGC are being placed into several compute cloud analysis facilities to allow for uniform mutation-calling and data mining by ICGC researchers. In the “Cancer Genome Collaboratory”, a project just approved in March 2014, we will be placing the entire ICGC data set into two compute cloud centers for access by the general research community. I will talk about the challenges and solutions that we are working on in connection to these two projects.
Notes #
ICGC Project
- International Cancer Genome Sequencing Consortium
- 5th year of operation
- multi-national collaboration
- Includes all of the TCGA projects
- Goal: Identify the common patterns of mutation in all major cancer types
Simple experimental design:
- take normal (blood) and tumour (biopsy) samples from a series of donors
- sequence
- identify cancer-related mutations
- relate mutations to tumor bio
- translate this knowledge to improved diagnosis and treatment & make avail
ICGC db growing in size - moved from exome sequencing to whole genome
- 10K+ donors
- 4M+ somatic mutations
- 49K CNVs
- 6K+ methylation profiles
Available to public - Website @ http://dcc.icgc.org
- very nice data browser
- faceted view of various data types and donor types
- changes in a context sensitive way
- updates list with dynamically updated graphs/summary
- links to raw data @ CGHub
- view most mutated genes in selected cancer subtype. Can keep drilling down through stats/projects. Or look at summary - transcript level / protein level.
Original Database - based on BioMart
- mysql based data mart - developed and used by EnSEMBL project
- de-normalized data schema (reverse-star schema)
- scaled well for human and other invertebrate genomes
- worked well until release 12
- One problem: as the data got larger, BioMart didn’t scale
- Release 8 & 9: three month release cycle (freeze, prep, load, QC)
- by release 11 - load phase taking 2-3 months! Missing release window. Were announcing new freeze before new db released
September - complete rewrite of entire dcc (Ferretti). Heavy use of distributed computing.
Process:
- genome centres submit flat files + meta
- validation (Hadoop cluster - HDFS distributed filesystem)
- loaded into MongoDB (on cluster)
- Combined w/ other info (gene annotation from Ensembl, uniprot, cosmic, etc)
- Indexed by ElasticSearch (another cluster)
- Indexed info stored in mongo - drives the portal
- Total time for loading for release 15: 42 hours (not yet optimized)
What about raw read data?
- ~10 PB Genome data by 2018
- depositing all genome data in EGA. In theory, researches go to EGA and dl data. In practice, data too large. Takes too long.
- will soon be completely inaccessible - except maybe for some large groups, or those located in the UK
- This is an important legacy dataset that can still be mined
- Current mutation calling algorithms not perfect. Different groups have low overlap. Different filtering systems. Many false positives (e.g. titan). Our ability to predict gene rearragements quite poor.
- want to go back to the data to get more info as our algorithms improve
The solution => The Pan-Cancer Whole Genome Analysis Project (PAWG/Pan-Can) #
- Goal: understand what’s going on in the 95% of the cancer genome that isn’t protein-coding
- Resources: 2K whole genome tumor/normal pairs from ICGC
- Analytic issues: calling cancer mutations in non-coding regions is an evolving art. Need uniform pipeline. Dataset - 0.5PB.
- Cloud based approach - six cloud compute centres in USA, Europe, Asia
- Phase 1: Partition data among the data centres. Perform alignment and mutation calling in a distributed fashion
- Phase 2: Synchronize alignments and mutation calls. Each centre will have complete set of alignmetns and mut calls
- Phase 3: Open up (subset) of of clouds to allow researchers to do analysis
Technologies: OpenStack (5 centers) and vCloud (EBI)
- Vagrant - vm abstraction layer (make clouds look similar)
- network transfer and metadata - GNOS / GeneTorrent (from Annai Biosystems Inc) - commercial solution
- Workflow management - SeqWare pipeline manager (OICR & UNC developed - O'Connor) synapse from sage
Status
- Ethical approval, usage agreements signed - Legal
- OpenStack/VMware, vagrant SeqWare installed
- alignment workflows executed on some vms
Challenges
- Legal - regional differences have not gone away. Datasets from TCGA (us) can be hosted by certain US based institutions trusted by NIH. NIH has not approved phase II of the project due to the way the consent was written. It can be interpreted as ‘not allowed to use on cloud’ (But cloud didn’t exist when the consent was written). Europe - some countries are sensitive to distribute their data to US based data centres (Snowden & NSA).
- Technical - adapting grid based hpcs to use cloud-based technologies. Running 8 weeks behind
Why not a commercial cloud? Amazon, Google, MS
- legal and ethical issues
- preliminary ethics approval to ICGC. Some restrictions - can’t cross regulator borders without notice
- NIH reviewing approval for TCGA sets
What happens when Pan-Can is done ~ 1 year? The group has received funding from Canadian funders: The Cancer Genome Collaboratory
- long-lived private cloud compute centre, pre-populated with ICGC datasets
- any individual can create an account and access the data via api
- have an integrated benchmarking core, bioethics, community outreach
- Initially two physical data centres (w/ Grossman in Chicago) & Toronto. Connected by high speed link
- Funded as of March 1
Q & A #
Q: (Ware) Many of us have been using BioMart and the scalability - how portable is your new system as a replacement for BioMart?
A: on a scale of 0 - 100: -1. This is a highly specialized system designed just to work with our data. Biomart is alive and well in Italy
Q: What cancer types were chosen for the pan-cancer analysis? And why?
A: Our criteria for inclusion is at least 30x coverage for whole genome, tumor normal pair, proper consent from donor.
Of that, we have ovarian, breast, lung, pancreatic, liver, leukemias – about 13 in all
The final list of tumor types won’t be selected till we’ve qc'ed al the data and know what the distribution is
*Q: If the 10PB of data that will be generated will be harmful - look at quality compression and other *
A: No chance that we’ll be storing adn distributing full uncompressed 10PB. Actively benchmarking compression systems. Hopefully get it down to a few PB without loss of information
Q: What is the main objective of this project? Biological objective?
A: The main biological object - focusing on patterns of alteration in non-coding regions. E.g. know there are mutations in regulatory regions - we haven’t characterized.
groups looking at:
- Looking at regulatory networks - interactions wiht coding regions.
- Patterns of rearrangement
- Evidence of insertion of known and unknown pathogens / virus that may be driving the tumours
Looking at this in a uniform way we’ll learn common mechanism and mechanisms that are distinct
Q: How willing are your users to get random samples in return as opposed to the full data? Plus confidence score
A: Key method of access - take slices of the raw data in the region that you’re interested in. Or extend and do a random sampling - feature available of CGHub and widely used. Not a feature of EGA - annoying deficit. One of the reasons we want to move away.
Q: Majority of researchers - don’t need to develop alignment algorithms. Are processed data available to researchers?
A: The interpreted data (still large, but much smaller - in GB not TB) is available for browsing and dl and abstraction and available from http://dcc.icgc.org
Q: Curious how you are designing your APIs? APIs for visualization are different from tools
A: Start with the user interface, figure out what it needs to display, and work back to the API. A genome browser has a very different api than the faceted browser where you’re looking at a particular biological pathway. Specialized APIs and indexes for each of those.
The Genographic Project #
Genome Analytics with IBM Watson #
Ajay Royyuru, IBM T.J. Watson Research Center, USA
Director of computational biology
Abstract #
// last minute topic change, no published abstract
Press release: http://www-03.ibm.com/press/us/en/pressrelease/43444.wss
Notes #
Research group at IBM - very focused on computational biology.
Intersection of everything IT and Life Sciences.
3 pillars of work (IBM computational biology)
- managing and analyzing the data explosion - makes biology more amenable to quantitative outcomes
- predicting biological outcomes with scale of computing
- dealing with complexity. DREAM - IBM team with community is heavily involved
Why:
- Intruiged by connections made yesterday (DH, JM)
- Sequencing is reaching a point where we have to look at the translational aspects
- beginning to make an impact in teh clinic
- takes a community
- IBM Watson - can be used here
- On IBM’s cloud system - rapidly scale. The sorts of analytics capabilities - it begins to be scalable and accessible so it can have the impact on the clinic down the road
What are we up to: Gathering raw sequencing input, through large number of steps so that we will eventualy get useful info that may lead to action
3 pillars in the journey of genomic medicine
- sequencing (includes downstream analysis - variant calling)
- translational medicine (have VCF) <– will focus on this piece (VCF to actionable)
- Actionable intelligence - Personalized healthcare. Something publishable is our goal
Translational Medicine: #
System that generates insights
Input:
- data coming from sequencing (VCF) - patient specific information
- Entirety of what you can point Watson to - All available biological knowledge (PubMed, NCI PDQ)
All this is ingested. Running on IBM’s cloud layer (SoftLayer) - large/global/scalable/acquired by IBM.
Generates some actionable insights.
Goal: this goes to tumor oncologists, look at data in context of decision trying to make. Hopefully make informed correct decision.
IBM Watson #
- began 2008 - research project
- Jeopardy - grand challenge (got attention)
- Added genomics capabilities!
Genomics - not just about genes. How we connect that knowledge #
The traditional way: read papers, develop hypotheses -> interpretation -> actionable output. Can we automate this? Can we come up with new research approaches from the literature?
p53 project example - ingest a lot - mine the literature. #
- lots of text, natural language, analytics happening
- specific to diseases, compounds (drug molecules)
- Human readable sentences - use Watson based technology to translate the information into machine readable. ‘the results who that EPK2 phosphorylated p53 at Thr55’ - extract info with Watson
- Extraction is working
Application to genomics:
on SoftLayer, physican managing cases (biopsy samples) submission - uploading VCF.
What analysis can be done -
- circos representation, where they occur, where translate to
- map to available info on pathways
- what more can you find in liternature, Watson? - adds links (to literature) from text mining. Can drill down and find out why links were generated
- Drugs - targetting pathways: added in datamodel
Summary: researcher can browse. print report for the record.
- see provenance of the data and keep a record of it
- see all visualizations, records, summary
- possible list of all possible drugs, status (approved?)
- this insight is available to the research
Looking for active collaborations - dont’ generate this data themselves
- last week: partnership with NY genome centre (collaboration of research centres in NY area). Can take this technology and apply it with them. Get practical use of this technology
- Not exclusive to NY genome, can open collaborations with others
Sample report- generated with early data
- TCGA GBM data - reshaped to put in system
- generated report (many pages long)
- list of drugs with reasons why the drug is contextually relevant
e.g. Lidocaine in report: not prepared to see this in here
- showed to oncologists - click through to evidence. Watson points to papers - Lidocaine assay on cancer cells (tongue, EGFR receptor). Lidocaine being tested in context of thyroid cancer cells
- so this is not out of the realm of what we should be thinking about
- helps us be current and comprehensive
Q & A #
Q: (Ouellette) Do you have any evidence on how Watson will do if it read full papers (not just abstracts)?
A: Not tested in this context. Watson does read full papers in a clinical context
Q: (Mesirov) -
1. Are you aiming with that package towards the practicing oncologists or the research physician?
2. To what extent have you compared what Watson is able to mine from the data with other approaches/algorithms/packages published and available to the community?
A:
- It’s a journey - early adopters, research clinicians who have the expertise and interest to be partners. A lot of learning. For example, Watson shows lots of evidence. You need a clinician research who understands the subtleties of the research and how to make decisions that will be useful
- Not whole scale comparison yet - still in ingest and build mode. Some benchmarking and testing - working on the baseline. Full scale comparison for later. Watson can also do chemical extraction - full scale comparison here.
Q:
- Is there any way to integrate other sources of information not text based? Images? Protein structures?
- human value added in human curation databases?
A:
- Image analytics is an interest to us. Study going on here. Working with some large medical institutions on this project.
- Melding between machine and human curation -> this accelerates the process. Makes it more usable.
Q: Doubts whether practicing physician will know what VCF is, understand Cicos plot? Watson to user or user to Watson?
A: Initial set of end users - clinician researchers. They got the sample, they know what VCF is. This is the community that will find this useful. What can we simplify to make this more useable.
Right now, collaboration.
Human Genome Analysis #
Mark Gerstein, Yale University, USA #
Director: computational biology
ENCODE, 1000 genomes
Abstract #
Plummeting sequencing costs have led to a great increase in the number of personal genomes. Interpreting the large number of variants in them, particularly in non-coding regions, is a central challenge for genomics.
One data science construct that is particularly useful for genome interpretation is networks. My talk will be concerned with the analysis of networks and the use of networks as a “next-generation annotation” for interpreting personal genomes. I will initially describe current approaches to genome annotation in terms of one-dimensional browser tracks. Here I will discuss approaches for annotating pseudogenes and also for developing predictive models for gene expression.
Then I will describe various aspects of networks. In particular, I will touch on the following topics: (1) I will show how analyzing the structure of the regulatory network indicates that it has a hierarchical layout with the “middle-managers” acting as information-flow bottlenecks and with more “influential” TFs on top. (2) I will show that most human variation occurs at the periphery of the network. (3) I will compare the topology and variation of the regulatory network to the call graph of a computer operating system, showing that they have different patterns of variation. (4) I will talk about web-based tools for the analysis of networks (TopNet and tYNA).
http://networks.gersteinlab.org
http://tyna.gersteinlab.orgArchitecture of the human regulatory network derived from ENCODE data.
Gerstein et al. Nature 489: 91Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors.
KY Yip et al. (2012). Genome Biol 13: R48.Understanding transcriptional regulation by integrative analysis of transcription factor binding data.
C Cheng et al. (2012). Genome Res 22: 1658-67.The GENCODE pseudogene resource.
B Pei et al. (2012). Genome Biol 13: R51.Comparing genomes to computer operating systems in terms of the topology and evolution of their regulatory control networks.
KK Yan et al. (2010). Proc Natl Acad Sci U S A 107:9186-91.
Slides #
Notes #
My perspective on Big Data #
- buzz word, data science
- HBR - data science the sexiest job of the 21st century (http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ar/1)
- transforming science
- explosion of data in genomics - sequencing price going down faster than Moore’s law. Cost is in management of data
- Current state of large sequencing dataset TCGA 910 TB in CGHub, + smaller datasets
What do people do with big data? #
Take this data to answer a question, make a prediction, modelling
Two ways to approach:
- don’t care about structure, just want answer (google search)
- with explicit organization of dataset (google maps, google earth)
In science - search for Higgs boson - searching through many for a few needles (fits in #1)
In genomics - we’re in #2
- we want to make a map of the molecular world we have
- but we don’t have an immediate metaphor we can hang all our information on
- but we don’t know what the structure of that map is
- ENCODE - thought about the structure of the map. Layer information down
- genomics has been around for a while - one of the first big data disciplines. Inspired by pandora - music genome project which was inspired by how geneticists organize information. We should learn from other disciplines
How we can organize information in genomics - networks #
- regulatory networks as a hierarcy
- more connectivity - constraint
What is genome annotation? #
Tracks in genome browser - linear view of how to think of genome.
How will this scale with thousands of tracks? No
What type of information do we want? Actually thinking of 3D molecules - but not quite possible
Network diagram - middle ground
- works for cancers/biology pathways
- compelling approach to big data
- Example: we started off with linear annotation (ChIP-Seq experiments)
- Then, created proximal edge at peaks.
Generated a hairball of .5million edges, paired down to 25K edges.
Many edges far away from genes - distal sites.
analyze networks - network science
- Hub - point with many neighbours
- bottleneck - max # of shortest paths
- Identify bottlenecks & hubs (like roads, bridges can be bottlenecks)
Directed entities - regulatory networks
- one thing regulates another
- Hierarchy - intuitive - people understand this
- optimally arrange transcription factors (ENCODE) into 3 levels by simulated annealing, maximizing downward pointing edges
- higher bottleneck-ness in centre layer - information flow
- Can think about molecules - does this make sense for molecules.
Integration of TF hierarchy with other ‘omic information.
More connected and influential on top - Same thing with miRNA networks (bi directional)
- Can look at how transcription factors are working together. Pick two, can look at the degree they co-regulate the target
Other organisms: Yeast genome #
Similar, but has four levels. Multi-regulated network with bottlenecks
Different types of hierarchies
- autocratic (military)
- democratic (things at top mostly regulating, bottom mostly being regulated)
- intermediate - between the two. Ease some information bottlenecks
Developed a scheme to measure the degree of x-linking structure. Degree of collaboration
- number of overlapping
- find over many organisms: get a lot more confidence that inclusions are true
- middle layer has highest degree of collaboration
Compare humans w/ E. coli & yeast & rat: humans more collaborative nodes
Yeast network similar structure to government hierarchy w/ middle managers: matches gov’t of Macao
Social science - literature on people studying how important you need middle managers talking to each other
Variation network
- map all SNPs in 1000 genomes on network
- more SNPs at bottom
- higher parts of hierarchy more conserved, less variable
- Trend: more hubs - less variation/ more connectivity, more constraint.
Seen in many studies/organisms.
Human protein-protein interaction network - rapidly changing on the outskirts
Analogy to understand more connectivity -> more constraint #
Comparison between e. coli regulatory network and Linux OS
- call graph in linux compared to e. coli regulatory network
- linux is top heavy in comparison
- E. coli: dominated by out degree hubs - turn on a lot of molecules
- linux: dominated by in hubs - routines called by many programs
- linux OS evolves - we can watch it through each of its releases
- plot changes & compare.
E. coli: less change.
Linux: certain that don’t change, some things change constantly. Some releases coupled to hardware, has to change - In biological system - negative correlation connectivity is less change
- In linux - positive correlation - connectivity is more change
- Perspectives on random change v. Intelligent Design.
Intelligent designer - they believe they can make changes where there is a lot of constraint and connectivity.
If changes are random - best to not put them in central points
Applications of more connectivity leads to more constraint - no time to talk today. Building a practical workflow & tool for disease genomes.
Network stuff available - encodenets.gersteinlab.org
Q & A #
Q: (Stein) you showed this relationship between Hub-ness and Kernel call graph. Have you looked at the evolution of the call signature? Highly connected subroutines do not have their call signature called frequently - more similar to bio
A: No, very interested in that. Evolution - even package dependencies.
Q: Information flow: makes sense in regulatory networkers. What’s your reasoning with protein-protein networks?
A: Some times of protein-protein interaction networks, but other times not so much. Key network params - regulatory, focused on bottlenecks. Protein-protein - focus on hubs. When you do the correlations of connectivity with constraint - more on bottlenecks.
Q: Interested in E. coli v. linux - we compare a lot to engineering ideas
A: Maybe not a lot of engineering ideas apply to biology. Sometimes people look at biological networks to apply to engineering problems
Q: have you looked at hubs in organisms with recent genome duplications to see how they occur?
A: genome duplicates, suddenly have these two things interact with your hub or what’s there. Lots of network literature on scale free networks - plays into that.
Q: What do you think about the cell type specificity - do you think different cells depending on their needs will have different hierarchies?
A: Controversy in how I present this. Cell type non-specific hierarchy - this is a global wire diagram. In my mind, if you go to certain cell time, certain lights turn on. Other view - cell type specific hierarchies. I think this doesn’t make sense - no one talks about gene list
The BioCompute Farm: Colocated Compute for Cancer Genomics #
Stuart Young, Annai Systems Inc., USA #
Abstract #
Pedabyte-scale genomic data repositories such as the Cancer Genomics Hub (CGHub) require collocated compute resources to fully leverage the value of the genomic data. The traditional model of data download from a repository to a research center followed by local computational analysis suffers from high file transfer costs, significant delays and file storage problems. The BioCompute Farm, a highly-scalable computing resource colocated with CGHub, provides a 99.9% reduction in data storage and 120 times reduction in time for analysis of all 40TB of the current Cancer Genome Atlas (TCGA) RNA-Seq data set. The BioCompute Farm combines high-speed BAM slicing for DNA analysis and the latest in bioinformatics tools and standardized pipelines with the flexibility to customize pipelines and rapidly scale up computational capacity to meet the needs of cancer researchers. As data growth continues to outpace the growth of Internet bandwidth, the BioCompute Farm can serve as a model for the emerging paradigm of colocated compute resources serving the users of large genomic databases.
Notes #
Motivation for talk: why colocated compute #
- '07/'08 - next gen suddenly became a viable product
- before this, fairly expensive Sanger sequencing
- soon - began to overshoot the cost of storage and bandwidth
- only will become worse
- to address this: need to provide a solution to provide capacity and service
Annai systems: director of bioinformatics #
- Software underpinning CGHub - Annai-GNOS
- server to genetorrent - dl sequences
- bioCompute -colocated w/ CGHub
How big is this problem?
- TCGA data ~ 1PB, -> 2.5 in the next few years
- download rates: several months to download it all. Store it. Need infrastructure.
- researches limited by financial and logistical constraints (IT)
Survey by NCI - wish list for cancer genomics researchers
- #1 Run workflows on data in cloud (13%)
- Annai covered about 50% of what they want. Maybe biased sample (online)
NCI’s colocation model
- Genomic Data Commons - integrate multiple datatypes, provide API
- Cloud Pilots - $20M, colocated compute. The successful bidders will provide workflows and be scalable
BioCompute Farm (TCGA data)
- what they’re doing with sequencing - shifts cost of sequencing to getting data and results out
- upstream costs: technology development, pipelines, bioinfo tools
- downstream costs: tools for sequence analysis, management of
/// LOST CONNECTIVITY FOR AWHILE/ //
HIPPA Compliance
- wholistic expectation - bookkeeping where access is controlled
- Physical security: Cage in SDSC - monitored, power, alarms
Provide farms with subscription based access
Provide custom analysis
- farm loaded with standard pipelines: broad GATK, PanCancer BWA alignment
- Custom Pipelines - latest versions
- Workflow tools: SeqWare (O’Connor), agua, synapse
- Use Case Baylor - BAM-slicing of TCGA RNA-Seq data
- would have taken 9weeks of dl time + storage (no capacity)
- They used biocompute farm, used bam =0slicing of CGHub bam files on Annai’s GTFuse
- Pipeline Optimization - look at runtimes, will this benefit w/ parallelization or throwing more cpu?
Collaborations #
PanCancer project
- prototype of global federated colocated compute
- setting up servers, SeqWare,
DREAM challenge
- variant calling
- Annai provides GNOS platform for data security and download
ShareSeq
- hosting ICGC- common free access to download free data
- provide colo-compute
Conclusion #
- colo compute is a no brainer
- useful functionalities - fast access, flexible use, tools for workflow, and custom analysis and scalability
Q & A #
Q: Only 5 or 10 labs in the world are interested in whole PB scale data. I think if we make the VCF file available - this should be sufficient for most researchers.
A: I think with the way things are going, the issue is not only going to be huge data access, but secure access, and how can we search through the data to find the datasets you want.
Q: Most of the pipelines are focused on variant calling, alignments - what are the priorities for what’s next?
A: Yes, it’s variant calling right now. One other area of interest- systems approach, pathways, integrating different types of data. Looking at different standards, read pathology or clinical data. Hospital data is very rich for researchers, but not very accessible. Looking at integrating with genomic data.
Short Talk: An Overview of the Bionimbus Protected Data Cloud #
Robert L. Grossman, University of Chicago, USA #
Abstract #
Bionimbus is a petabyte scale community cloud for managing, analyzing and sharing large genomics datasets that is operated by the not-for-profit Open Cloud Consortium. With a cloud computing model, large genomic datasets can be analyzed in place without the necessity of moving it to your local institution. Bionimbus contains a variety of open access datasets, including ENCODE and the 1000 Genomes dataset. In 2013, we updated Bionimbus so that researchers can analyze data from controlled access datasets, such as The Cancer Genome Atlas (TCGA) in a secure and compliant fashion. We describe some case studies using Bionimbus, some of the bioinformatics tools available with Bionimbus, some different ways of interoperating with Bionimbus, the Bionimbus architecture, and the security and compliance framework.
The Bionimbus Protected Data Cloud is supported in by part by NIH/NCI (grant NIH/SAIC Contract 13XS021 / HHSN261200800001E), the Gordon and Betty Moore Foundation, and the National Science Foundation (Grants OISE - 1129076 and CISE 1127316).
Notes #
I’m going to pose a few questions. In the next 10 min I will not try to answer them. Hopefully your answers will be more interesting than mine. I will give you a framework of how we think of big data.
Four questions #
- Is big data in bioinfo/biomed any different than big data in science. Is big data in science any different from big data general?
- what instrument should we use to make discoveries over big biomed data?
- do we need new types of mathematical and stat models for big biomed data?
- how do we organize our data?
Bionimbus protected data cloud #
Supporting Pan-Can analysis - open source core
- interoperate with as much proprietary as they can
- log in with NIH/eRA credentails - immediate access to TCGA data
- pipelines, analysis, install your own software
Right now process of scaling up
- 10-20 projects a month
- contain TCGA data- operate on PB scale
- sometime next week, another PB of data & 16K cores, ICGC Pan-Can analysis
- question: how do we make sure, on this limited resource, we get the most science out?. Traditionally handled by allocation committees
- this month, would have cost >$3K on amazon
Open science data cloud #
- support integrative analysis: Can look at how disease is impacted by socio-economic factors and more. Text analytics & geospacial analytics
- 4 years old (Bionumus 1 year)
biomedical commons cloud
- involves cancer centres, open source core but operates with proprietary software around it
- want to peer at scale with other providers (biomed commons providers)
- like how internet was started with tier one ISPs
- sometimes faster to get data at high performance network than over disk with certain protocols
New era #
- '05-'15: bioinformatic tools and integration (Galaxy, GenomeSpace, workflows, portals)
- '10-'20: data center scale science (Bionimbus, CGHub, cancer collaboratory). At that scale what changes and how do we build models
- '15-'25: new modelling techniques
What are the new models? '72 phil anderson wrote a piece: is more different
- http://robotics.cs.tamu.edu/dshell/cs689/papers/anderson72more_is_different.pdf
- up to us to decide if is more the same and if it is how do we model that
- backlash on google flu
How do you scale machine learning to data centers?
- take large complex datasets and chop them up in small pieces you can analyze at scale
Is more different at this scale? And if so, how do we discover it?
Q & A #
Q: (Ware) as you see these data centres emerging, do you think they’ll focus on specific questions? How do you see the data centres forming?
A: The ones I mentioned are around cancer genomics. Sustainability and payment - putting small taxes on certain of our projects so that we can make larger amounts of our data available. Driven by some funding agencies. There’s a certain interest of private donors funding certain parts of this. Some economic incentives. Some combination of that is going to change the way we do science.
Short Talk: Pan-Cancer Analysis of Somatic Variation from Whole Genome ICGC / TCGA Datasets #
Adam Butler, Wellcome Trust Sanger Institute, UK #
Abstract #
The advent of massively parallel sequencing technology has revolutionised the way we characterise cancer genomes and provided new insights in our understanding of the mechanisms of oncogenesis. The International Cancer Genome Consortium (ICGC) was instigated in 2007 with the aim to systematically screen hundreds of Cancer Genomes for 50 distinct tumour types and catalogue the somatic variation present. This endeavor aims to prevent duplication of effort, ensure rare tumours are included and generate large datasets for the scientific community. A similar project is underway in the USA, The Cancer Genome Atlas (TCGA).
In late 2013 at the ICGC conference in Toronto, Peter Campbell announced an ambitious plan to undertake a Pan-Cancer analysis of whole genome data available from ICGC and TCGA. This would provide a comprehensive dataset of somatic variant calls with standardised output for 2,000 cancer genomes, which will be available for subsequent downstream analyses.
The primary analysis will include detection of somatic point mutations, small insertions and deletions, copy number changes, rearrangements and retrotransposon/viral integration sites. To ensure integrity of the dataset, three independent analysis pipelines, provided by the Broad Institute, DKFZ and the Sanger Institute, will be utilised. The data will be generated and stored at 6 data centres around the world; Spain, Germany, Japan, UK, and two centres in the USA.
The Sanger Institutes contribution to this initiative is to provide our analysis pipeline as one of three to be run over the data. Consequently our algorithms have been assessed via rigorous comparison with comparable software and their performance optimised. The pipeline is currently being ported into a VM (Virtual Machine), automated and the code adapted for running all variant detection analyses within a cloud environment.
The primary analysis will deliver a high-quality catalogue of somatic variants in a standardised VCF format and made available from the six centres for downstream investigation.
Notes #
Go over our part and experience with the Pan-Cancer analysis with large datasets
The Cancer Genome Project #
- 2000 - working through Sanger sequencing, then next gen '07
- In order to handle different datasets - build analysis tools and pipelines and system
- use them to this day to analyze
- heavily integrated into Sanger infrastructure. Now have to look at with bigger scale data
Pipeline:
- BWA alignment
- Tools: copy number caller - ASCAT - ins/del, rearrangements, transposon, RNA-Seq pipeline
- generate VCF, BAM, allow researchers to get useful parts of info and drill down
PanCancer - large international collaboration #
- 2K genome pairs (4K genomes) from multiple tumour types, 30x coverage
- uniform dataset
- analysed using 3 pipeline (Broad, DKFZ, Sanger)
CGP -> PanCancer
- need to take out each part and make it Sanger free
- optimize for different version of aligner
- pipeline whole lot using SeqWare (O'Connor)
- Just a few seconds - but they add up over few billion bps
Phase 1
- identify data for upload, align each sample pair
- using GeneTorrent to dl data from CGHub - works very well. Personal concern was on getting data from where it was to where it needed to be. Getting astonishing transfer rate. Automatic data upload.
Useful outcomes #
- we moved over to using a version of BWA-MEM (from BWA)- significantly faster and smaller memory footprint. May use for in-house pipelines
optimized callers
- looked at where their code was spending time
- made huge steps forward - substitution caller is 50% faster
- indel caller 2x faster
- ICGC benchmarking exercise - invaluable. Allowed us to make much better judgements on how well we are doing
- new sequencing technologies go faster still…
Q & A #
Q: (Ware) interested in optimization for indels - can you push that any further? Many of our bottlenecks are in aligners built for human (work in plant)
A: What’s it written in? Perl/Java - eyes roll back in heads and they start shaking. Joking aside, with Caveman (substitution caller) - given someone the time to go back and just re-code proved to give us massive improvement. Recoded in C. Not glamorous or groundbreaking - C really is faster.
back to the speaker list → #
Short Talk: Extensive Variation in Chromatin States Across Humans #
Maya M. Kasowski, Yale University, USA #
Abstract #
The majority of disease-associated variants lie outside protein-coding regions, suggesting a link between variation in regulatory regions and disease predisposition. We studied differences in chromatin states using five histone modifications, cohesin, and CTCF in lymphoblastoid lines from 19 individuals of diverse ancestry. We found extensive signal variation in regulatory regions, which often switch between active and repressed states across individuals. Enhancer activity is particularly diverse among individuals, whereas gene expression remains relatively stable. Chromatin variability shows genetic inheritance in trios, correlates with genetic variation and population divergence, and is associated with disruptions of transcription factor binding motifs. Overall, our results provide insights into chromatin variation among humans.
Notes #
Chromatic variation among people
What makes people different? #
- Level of DNA sequence - SNPs
- But how do these variants translate to phenotypic differences
- Look at gene expression. Look at differences in chromatin
- Mapped NFkB
Differences in histone marks differences in gene expression? #
Aim:
- Characterize variation in chromatic state
- Genetic basis, functional consequences
Used HapMap populations - 19 individuals
- 9-13 histone marks - deeply sequenced data
- Convenient - powerful tool for functionally annotating genome
- Enhancers/promoters/ etc
How much variation in chromatin among individuals? #
There’s an enhancer that is active in caucasian and 2 asians, but not africans - SNP in NFkB motif
Striking variation - more than 30% variation at some marks
Combinatorial - chromatin states based on combinations of the marks
- promoter states
- transcribed states
- variety of enhancer states
- repressed states
Found that it was more meaningful to ask whether a particular mark varies in the context in a particular state than overall
- looking at active enhancer mark - varied more in enhancer state than promoter state
- state specific variability
- enhancer states more variable than transcribed or promoted
- repressed mark - varies more in combination with active marks than on its own
Do states switch among individuals?
- not the case, enhancer is an enhancer across individuals.
- some reciprocal states
Genetic basis of variation
- Active enhancer mark - evidence of strong genetic basis. Strong correlation to genotype to variable than non-variable
- Family trios: heritability. found that the extent of varience in daughter correlates to parents
Possible mechanism - differences in TF binding motifs
- Strong evidence of this
- Link variation to specific motif disruption
- Looked at peaks, ENCODE
Functional consequences:
- There’s a strong correlation with gene expression (active enhancer - RNA-Seq data). For known enhancer gene lengths (but imperfectly known)
- Not all enhancer variation influences expression (but most of them were). Why? - the enhancers are buffering each other. Non-consequential enhancer variation
- Chromatin variation is likely to influence phenotypes. Variant regions enriched in eQTLs and GWAS SNPs
Q & A #
Q: (Ware) epigenetic change- were you able to use those as biomarkers and retest GWAS? Uncover hidden variation?
A: Haven’t look at that. This study, 19 individuals. But as we up the scale, perhaps.
Q: Did you look at the trios to see if there’s more concordance among their epigenetic marks than you would have expected on the basis of shared SNPs?
A: Didn’t look at that, we had two trios.
Other posts in this series: #
- Big Data in Biology
- Big Data in Biology: Large-scale Cancer Genomics
- Big Data in Biology: Databases and Clouds
- Big Data in Biology: Big Data Challenges and Solutions: Control Access to Individual Genomes
- Big Data in Biology: Imaging/Parmacogenomics
- Big Data in Biology: The Next 10 Years of Quantitative Biology