April 1, 2014

Big Data in Biology: Large-scale Cancer Genomics

Series Introduction: I attended the Keystone Symposia Conference: Big Data in Biology as the Conference Assistant last week. I set up an Etherpad during the meeting to take live notes during the sessions. I’ve compiled all the abstracts, notes and slides (where available) here. Shout-out to David Kuo for helping edit the notes.

Warning: These notes are somewhat incomplete and mostly written in broken english

Opening Remarks #

Lincoln Stein, Ontario Institute for Cancer Research, Canada

Tremendous number of things happening in the big data world.
It has even become somewhat cliché - Time magazine has ‘Big Data’ on its cover. Google, Amazon jumping in on Big Data. Biological world - immense opportunities.

Positives:

ICGC (International Cancer Genome Consortium), TCGA (The Cancer Genome Atlas) - sequencing kicked into high gear - whole genome sequencing (not just exome)
NIH - launched initiatives to collect clinically relevant variants
Google Flu project- predict flu outbreaks data analytics https://www.google.org/flutrends/us/
23andMe - popular

Not so nice things:

23andMe shut down
Snowden - big data can be sinister
increase number of thefts (disclosure of credit card files Target)
digital currency lost -bitcoin

Risks - need to trade off risks and benefits. Genetics, genomics, imaging, pharma

Goals of this meeting: network, experience, collaborations, understand how to exploit the oppourtunites big data has given us while avoiding pitfalls of handling this data inappropriately

Introducing Keynote speaker David Haussler
HHMI, UCSC, and more
http://en.wikipedia.org/wiki/David_Haussler

Large-scale Cancer Genomics - Keynote #

David Haussler, UCSC

Abstract #

Large-scale Cancer Genomics

UCSC has built the Cancer Genomics Hub (CGHub) for the US National Cancer Institute, designed to hold up to 5 petabytes of research genomics data (up to 50,000 whole genomes), including data for all major NCI projects. To date it has served more than more than 10 petabytes of data to more than 320 research labs. Cancer is exceedingly complex, with thousands of subtypes involving an immense number of different combinations of mutations. The only way we will understand it is to gather together DNA data from many thousands of cancer genomes so that we have the statistical power to distinguish between recurring combinations of mutations that drive cancer progression and “passenger” mutations that occur by random chance. Currently, with the exception of a few projects such as ICGC and TCGA, most cancer genomics research is taking place in research silos, with little opportunity for data sharing. If this trend continues, we lose an incredible opportunity.

Soon cancer genome sequencing will be widespread in clinical practice, making it possible in principle to study as many as a million cancer genomes. For these data to also have impact on understanding cancer, we must begin soon to move data into a global cloud storage and computing system, and design mechanisms that allow clinical data to be used in research with appropriate patient consent. A global alliance for sharing genomic and clinical data is emerging to address this problem. This is an opportunity we cannot turn away from, but involves both social and technical challenges.

Reference: http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-211.html

Notes #

Motivation for cancer: obvious

soon to be #1 killer surpassing heart disease
mutations in DNA - most incorporated during the lifetime of individual

BRAF V600 mutation - targeted drugs can give spectacular results

precision is possible
but all it takes in a single cell that becomes resistant to the therapy and grows again
need to understand the pathways to beat the disease
cancer - is defeatable: not passed down

How to sequence genomes on massive worldwide scale?

Centres not designed for PBs of data

Solution: CGHub - Cancer Genomic Hub #

1M total files downloaded
13PB of data transferred
1.4PB data
3GB/s typical download rate (fast!)

Limitations

Can’t store the world’s data in one database (political/security reasons)
Need many CGHubs - distributed across continents
Bob Grossman (Chicago - 2nd Hub trusted by NIH). Offering cloud computing with data
Want to bring in commercial cloud providers + home grown services
Huge problem to communicate between all these! => Global Alliance for Genomics and Health: Enabling Responsible Sharing of Genomic and Clinical Data

Global Alliance: http://genomicsandhealth.org/ #

Partner with others to work though issues of global genomic data sharing
Global Alliance: Don’t run projects, just help run. Example projects we help:
ICGC, Pan-Can (see Nature Genetics papers).
BRCA project - unite the world’s groups working on BRCA1 & 2 *No reason why we can’t exchange this information

Task Teams in the Global Alliance #

Existing File Formats Task Team #

spawned off by Data Working Group (Haussler & Durbin). http://genomicsandhealth.org/files/public/Priorities%20-%20without%20membership%20DWG_0.pdf
BAM/CRAM/VCF: Will address clinical use of these formats/data
Not efficient scalable systems for storing, exchanging and using DNA sequence data
File formats don’t scale well. API scales much better - Application Programming Interface
Includes EBI, NCBI, Google, Microsoft, Amazon + several academic centers. Everyone will benefit from a central API

Reference Variation Task Team #

Problem: We have a new reference genome GRCh38.
Need to revamp all of the data :( => time + $$

Goal: develop next gen human genetic reference that includes known variation

resolve incompleteness and inconsistency
identify known variants
standardized format for novel variants

Sequence graphs - representations of the genome where every base has an identifier (rsid or uuid – stable id)

a side is a pair composed of a member of {left, right} and a base instance
edges are unordered pairs of sides - the edges are bidirectional.
endpoints are sides, not nodes
simple DNA sequence = thread, representative of sequence graph in which each side of each base instance

Every DNA string uniquely mapped to the reference

each position Q in ref has left and right unique context set
any position P in an input dna strings maps to q if it has an equivalent left context of q, right context, or both. Mapping is FAST
// see diagram in presentation… very hard to describe
can think of this hierarchically
need to organize - same way to deal with multi-mapping situations

Berkeley Data analysis stack

Mesos
Hadoop
Spark
GraphX
Sequence graph API -> Written to sit on top of this stack

Pilot released in ~4mo

open source, portable
1st implementation: remapping VCF. Even our most accurate variant calling - multiple ways of expressing the same variant in VCF

Driver projects and benchmarks

task teams key driving projects
task teams coming up with benchmarks
providers would like benchmarks and can maintain them (google, amazon, etc)
Berkeley SMASH platform

Will work with ICGC to learn and apply principles.
ICGC: 2K whole cancer genomes

Million Cancer Genome Warehouse #

support research globally
cloud compute
privacy for patients
support 3rd party tools w/ common api
APIs - not file formats - 3rd party can build on
harmonized portable consents
benchmarks

Possible Genome Commons Architecture. 3 layers of databases:

Bottom: BAM (largest)
Middle: VCF
Top: interpreted data & clinical data

Cost:

$50/genome/year to store and analyze 1M whole genomes
~100 PB = 2mo of YouTube growth
Why cancer - high water mark for medical genomics. Can do it in cancer, can do it in any disease

What do we have now?

Don’t have a uniform pipeline - mutation calling. ICGC is going to redo it all
Discover breakpoints - DREAM competition
Whole cancer genomes - don’t settle with exomes/point variation. need to think about how we can look at these from a full genome structure point of view

Example: Glioblastoma (GBM) #

look @ 16 whole genomes (somatic mapped to normal)

gene CDKN2A - loss

one is lost by focal loss, one by complex event
why focal & complex?
Many cancers where evidence of event where cancer shattered & repaired. Massive genomic rearrangement. Oncogenes, double minute

Highlights from GBM analysis

list of recurrent mutations - candidates from drivers. Can correlate with subtypes (classical, pro-neural, neural, mesemchymal)
one of the deadliest cancers - clinically important distinctions (LGG, others)

// Left to notify KS ppl of a speaker switch. Missing notes… //

How do we get the statistical power of aggregated information? We need to overcome the social bottlenecks. Infrastructure issue.

Q & A #

Q: Is there a similar thinking/working group around the contextual data/metadata?

A: Charles Sawyers & Karen North - clinical working group

capture clinical phenotypes/data in a standard way
do not believe clinical data and genomic data need to be in the same database
- just have UUIDs - have a way to link info. Clinical data can stay at center

Q: Concept of UUIDs - can be arbitrary length?

A: length is an issue for community - needs to be long enough to uniquely identify positions

how robust do you want this to be? too long, don’t map as much, but very accurate. Vice versa.
Thinking about schemes with mismatches

Q: BAM files - move to 100s PBs - need a new scale. In imaging, we generate TBs of legacy image files. We have the same problem, we can’t pull large files off disks.

A: We need to talk! Same boat. Optimized stores - locally stores. Tradeoffs. Need to optimize.

Q: (Doreen Ware) Any working groups looking at the downstream statistical analysis? Population genetics - different ways of integrating?

A: The goal of global alliance is to provide a framework to other folks who can do the analysis. Meant to be a library/interface clever people use to do the analysis (3rd party apps)