Big Data in Biology: Large-scale Cancer Genomics
Series Introduction: I attended the Keystone Symposia Conference: Big Data in Biology as the Conference Assistant last week. I set up an Etherpad during the meeting to take live notes during the sessions. I’ve compiled all the abstracts, notes and slides (where available) here. Shout-out to David Kuo for helping edit the notes.
Warning: These notes are somewhat incomplete and mostly written in broken english
Other posts in this series: #
- Big Data in Biology
- Big Data in Biology: Databases and Clouds
- Big Data in Biology: Big Data Challenges and Solutions: Control Access to Individual Genomes
- Big Data in Biology: Personal Genomes
- Big Data in Biology: Imaging/Parmacogenomics
- Big Data in Biology: The Next 10 Years of Quantitative Biology
Opening Remarks #
Lincoln Stein, Ontario Institute for Cancer Research, Canada
Tremendous number of things happening in the big data world.
It has even become somewhat cliché - Time magazine has ‘Big Data’ on its cover. Google, Amazon jumping in on Big Data. Biological world - immense opportunities.
Positives:
- ICGC (International Cancer Genome Consortium), TCGA (The Cancer Genome Atlas) - sequencing kicked into high gear - whole genome sequencing (not just exome)
- NIH - launched initiatives to collect clinically relevant variants
- Google Flu project- predict flu outbreaks data analytics https://www.google.org/flutrends/us/
- 23andMe - popular
Not so nice things:
- 23andMe shut down
- Snowden - big data can be sinister
- increase number of thefts (disclosure of credit card files Target)
- digital currency lost -bitcoin
Risks - need to trade off risks and benefits. Genetics, genomics, imaging, pharma
Goals of this meeting: network, experience, collaborations, understand how to exploit the oppourtunites big data has given us while avoiding pitfalls of handling this data inappropriately
Introducing Keynote speaker David Haussler
HHMI, UCSC, and more
http://en.wikipedia.org/wiki/David_Haussler
Large-scale Cancer Genomics - Keynote #
David Haussler, UCSC
Abstract #
Large-scale Cancer Genomics
UCSC has built the Cancer Genomics Hub (CGHub) for the US National Cancer Institute, designed to hold up to 5 petabytes of research genomics data (up to 50,000 whole genomes), including data for all major NCI projects. To date it has served more than more than 10 petabytes of data to more than 320 research labs. Cancer is exceedingly complex, with thousands of subtypes involving an immense number of different combinations of mutations. The only way we will understand it is to gather together DNA data from many thousands of cancer genomes so that we have the statistical power to distinguish between recurring combinations of mutations that drive cancer progression and “passenger” mutations that occur by random chance. Currently, with the exception of a few projects such as ICGC and TCGA, most cancer genomics research is taking place in research silos, with little opportunity for data sharing. If this trend continues, we lose an incredible opportunity.
Soon cancer genome sequencing will be widespread in clinical practice, making it possible in principle to study as many as a million cancer genomes. For these data to also have impact on understanding cancer, we must begin soon to move data into a global cloud storage and computing system, and design mechanisms that allow clinical data to be used in research with appropriate patient consent. A global alliance for sharing genomic and clinical data is emerging to address this problem. This is an opportunity we cannot turn away from, but involves both social and technical challenges.
Reference: http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-211.html
Notes #
Motivation for cancer: obvious
- soon to be #1 killer surpassing heart disease
- mutations in DNA - most incorporated during the lifetime of individual
BRAF V600 mutation - targeted drugs can give spectacular results
- precision is possible
- but all it takes in a single cell that becomes resistant to the therapy and grows again
- need to understand the pathways to beat the disease
- cancer - is defeatable: not passed down
How to sequence genomes on massive worldwide scale?
- Centres not designed for PBs of data
Solution: CGHub - Cancer Genomic Hub #
- 1M total files downloaded
- 13PB of data transferred
- 1.4PB data
- 3GB/s typical download rate (fast!)
Limitations
- Can’t store the world’s data in one database (political/security reasons)
- Need many CGHubs - distributed across continents
- Bob Grossman (Chicago - 2nd Hub trusted by NIH). Offering cloud computing with data
- Want to bring in commercial cloud providers + home grown services
- Huge problem to communicate between all these! => Global Alliance for Genomics and Health: Enabling Responsible Sharing of Genomic and Clinical Data
Global Alliance: http://genomicsandhealth.org/ #
- Partner with others to work though issues of global genomic data sharing
- Global Alliance: Don’t run projects, just help run. Example projects we help:
ICGC, Pan-Can (see Nature Genetics papers).
BRCA project - unite the world’s groups working on BRCA1 & 2 *No reason why we can’t exchange this information
Task Teams in the Global Alliance #
Existing File Formats Task Team #
- spawned off by Data Working Group (Haussler & Durbin). http://genomicsandhealth.org/files/public/Priorities%20-%20without%20membership%20DWG_0.pdf
- BAM/CRAM/VCF: Will address clinical use of these formats/data
- Not efficient scalable systems for storing, exchanging and using DNA sequence data
- File formats don’t scale well. API scales much better - Application Programming Interface
- Includes EBI, NCBI, Google, Microsoft, Amazon + several academic centers. Everyone will benefit from a central API
Reference Variation Task Team #
Problem: We have a new reference genome GRCh38.
Need to revamp all of the data :( => time + $$
Goal: develop next gen human genetic reference that includes known variation
- resolve incompleteness and inconsistency
- identify known variants
- standardized format for novel variants
Sequence graphs - representations of the genome where every base has an identifier (rsid or uuid – stable id)
- a side is a pair composed of a member of {left, right} and a base instance
- edges are unordered pairs of sides - the edges are bidirectional.
endpoints are sides, not nodes - simple DNA sequence = thread, representative of sequence graph in which each side of each base instance
Every DNA string uniquely mapped to the reference
- each position Q in ref has left and right unique context set
- any position P in an input dna strings maps to q if it has an equivalent left context of q, right context, or both. Mapping is FAST
- // see diagram in presentation… very hard to describe
- can think of this hierarchically
- need to organize - same way to deal with multi-mapping situations
Berkeley Data analysis stack
- Mesos
- Hadoop
- Spark
- GraphX
- Sequence graph API -> Written to sit on top of this stack
Pilot released in ~4mo
- open source, portable
- 1st implementation: remapping VCF. Even our most accurate variant calling - multiple ways of expressing the same variant in VCF
Driver projects and benchmarks
- task teams key driving projects
- task teams coming up with benchmarks
- providers would like benchmarks and can maintain them (google, amazon, etc)
- Berkeley SMASH platform
Will work with ICGC to learn and apply principles.
ICGC: 2K whole cancer genomes
Million Cancer Genome Warehouse #
- support research globally
- cloud compute
- privacy for patients
- support 3rd party tools w/ common api
- APIs - not file formats - 3rd party can build on
- harmonized portable consents
- benchmarks
Possible Genome Commons Architecture. 3 layers of databases:
- Bottom: BAM (largest)
- Middle: VCF
- Top: interpreted data & clinical data
Cost:
- $50/genome/year to store and analyze 1M whole genomes
- ~100 PB = 2mo of YouTube growth
- Why cancer - high water mark for medical genomics. Can do it in cancer, can do it in any disease
What do we have now?
- Don’t have a uniform pipeline - mutation calling. ICGC is going to redo it all
- Discover breakpoints - DREAM competition
- Whole cancer genomes - don’t settle with exomes/point variation. need to think about how we can look at these from a full genome structure point of view
Example: Glioblastoma (GBM) #
- look @ 16 whole genomes (somatic mapped to normal)
gene CDKN2A - loss
- one is lost by focal loss, one by complex event
- why focal & complex?
- Many cancers where evidence of event where cancer shattered & repaired. Massive genomic rearrangement. Oncogenes, double minute
Highlights from GBM analysis
- list of recurrent mutations - candidates from drivers. Can correlate with subtypes (classical, pro-neural, neural, mesemchymal)
- one of the deadliest cancers - clinically important distinctions (LGG, others)
// Left to notify KS ppl of a speaker switch. Missing notes… //
How do we get the statistical power of aggregated information? We need to overcome the social bottlenecks. Infrastructure issue.
Q & A #
Q: Is there a similar thinking/working group around the contextual data/metadata?
A: Charles Sawyers & Karen North - clinical working group
- capture clinical phenotypes/data in a standard way
- do not believe clinical data and genomic data need to be in the same database
- just have UUIDs - have a way to link info. Clinical data can stay at center
Q: Concept of UUIDs - can be arbitrary length?
A: length is an issue for community - needs to be long enough to uniquely identify positions
- how robust do you want this to be? too long, don’t map as much, but very accurate. Vice versa.
- Thinking about schemes with mismatches
Q: BAM files - move to 100s PBs - need a new scale. In imaging, we generate TBs of legacy image files. We have the same problem, we can’t pull large files off disks.
A: We need to talk! Same boat. Optimized stores - locally stores. Tradeoffs. Need to optimize.
Q: (Doreen Ware) Any working groups looking at the downstream statistical analysis? Population genetics - different ways of integrating?
A: The goal of global alliance is to provide a framework to other folks who can do the analysis. Meant to be a library/interface clever people use to do the analysis (3rd party apps)
Other posts in this series: #
- Big Data in Biology
- Big Data in Biology: Databases and Clouds
- Big Data in Biology: Big Data Challenges and Solutions: Control Access to Individual Genomes
- Big Data in Biology: Personal Genomes
- Big Data in Biology: Imaging/Parmacogenomics
- Big Data in Biology: The Next 10 Years of Quantitative Biology