Big Data in Biology: Large-scale Cancer Genomics

Series Introduction: I attended the Keystone Symposia Conference: Big Data in Biology as the Conference Assistant last week. I set up an Etherpad during the meeting to take live notes during the sessions. I’ve compiled all the abstracts, notes and slides (where available) here. Shout-out to David Kuo for helping edit the notes.

Warning: These notes are somewhat incomplete and mostly written in broken english

 Opening Remarks

Lincoln Stein, Ontario Institute for Cancer Research, Canada

Tremendous number of things happening in the big data world.
It has even become somewhat cliché - Time magazine has ‘Big Data’ on its cover. Google, Amazon jumping in on Big Data. Biological world - immense opportunities.


Not so nice things:

Risks - need to trade off risks and benefits. Genetics, genomics, imaging, pharma

Goals of this meeting: network, experience, collaborations, understand how to exploit the oppourtunites big data has given us while avoiding pitfalls of handling this data inappropriately

Introducing Keynote speaker David Haussler
HHMI, UCSC, and more

 Large-scale Cancer Genomics - Keynote

David Haussler, UCSC


Large-scale Cancer Genomics

UCSC has built the Cancer Genomics Hub (CGHub) for the US National Cancer Institute, designed to hold up to 5 petabytes of research genomics data (up to 50,000 whole genomes), including data for all major NCI projects. To date it has served more than more than 10 petabytes of data to more than 320 research labs. Cancer is exceedingly complex, with thousands of subtypes involving an immense number of different combinations of mutations. The only way we will understand it is to gather together DNA data from many thousands of cancer genomes so that we have the statistical power to distinguish between recurring combinations of mutations that drive cancer progression and “passenger” mutations that occur by random chance. Currently, with the exception of a few projects such as ICGC and TCGA, most cancer genomics research is taking place in research silos, with little opportunity for data sharing. If this trend continues, we lose an incredible opportunity.

Soon cancer genome sequencing will be widespread in clinical practice, making it possible in principle to study as many as a million cancer genomes. For these data to also have impact on understanding cancer, we must begin soon to move data into a global cloud storage and computing system, and design mechanisms that allow clinical data to be used in research with appropriate patient consent. A global alliance for sharing genomic and clinical data is emerging to address this problem. This is an opportunity we cannot turn away from, but involves both social and technical challenges.



Motivation for cancer: obvious

BRAF V600 mutation - targeted drugs can give spectacular results

How to sequence genomes on massive worldwide scale?

 Solution: CGHub - Cancer Genomic Hub


 Global Alliance:
 Task Teams in the Global Alliance

Problem: We have a new reference genome GRCh38.
Need to revamp all of the data :( => time + $$

Goal: develop next gen human genetic reference that includes known variation

Sequence graphs - representations of the genome where every base has an identifier (rsid or uuid – stable id)

Every DNA string uniquely mapped to the reference

Berkeley Data analysis stack

Pilot released in ~4mo

Driver projects and benchmarks

Will work with ICGC to learn and apply principles.
ICGC: 2K whole cancer genomes

 Million Cancer Genome Warehouse

Possible Genome Commons Architecture. 3 layers of databases:


What do we have now?

 Example: Glioblastoma (GBM)

gene CDKN2A - loss

Highlights from GBM analysis

// Left to notify KS ppl of a speaker switch. Missing notes… //

How do we get the statistical power of aggregated information? We need to overcome the social bottlenecks. Infrastructure issue.

 Q & A

Q: Is there a similar thinking/working group around the contextual data/metadata?

A: Charles Sawyers & Karen North - clinical working group

Q: Concept of UUIDs - can be arbitrary length?

A: length is an issue for community - needs to be long enough to uniquely identify positions

Q: BAM files - move to 100s PBs - need a new scale. In imaging, we generate TBs of legacy image files. We have the same problem, we can’t pull large files off disks.

A: We need to talk! Same boat. Optimized stores - locally stores. Tradeoffs. Need to optimize.

Q: (Doreen Ware) Any working groups looking at the downstream statistical analysis? Population genetics - different ways of integrating?

A: The goal of global alliance is to provide a framework to other folks who can do the analysis. Meant to be a library/interface clever people use to do the analysis (3rd party apps)


Now read this

What I learned working at WormBase / OICR

Three weeks ago, I left the Ontario Institute for Cancer Research (OICR) to join the Mozilla Science Lab. Yesterday would have been my five year work anniversary at OICR. Since I don’t get a plaque now, this seemed like a good time to... Continue →