Big Data in Biology: Personal Genomes

Series Introduction: I attended the Keystone Symposia Conference: Big Data in Biology as the Conference Assistant last week. I set up an Etherpad during the meeting to take live notes during the sessions. I’ve compiled all the abstracts, notes and slides (where available) here. Shout-out to David Kuo for helping edit the notes.

Warning: These notes are somewhat incomplete and mostly written in broken english

Personal Genomes #

Tuesday, March 25th, 2014 8:30am - 12:00pm

Speaker list #

Lincoln D. Stein, Ontario Institute for Cancer Research, Canada

The International Cancer Genome Consortium Database -

Ajay Royyuru, IBM T.J. Watson Research Center, USA

Genome Analytics with IBM Watson -

Mark Gerstein, Yale University, USA

Human Genome Analysis -

Stuart Young, Annai Systems Inc., USA

The BioCompute Farm: Colocated Compute for Cancer Genomics -

Adam Butler, Wellcome Trust Sanger Institute, UK

Short Talk: Pan-Cancer Analysis of Somatic Variation from Whole Genome ICGC / TCGA Datasets -

Maya M. Kasowski, Yale University, USA

Short Talk: Extensive Variation in Chromatin States Across Humans -

Robert L. Grossman, University of Chicago, USA

Short Talk: An Overview of the Bionimbus Protected Data Cloud -

The International Cancer Genome Consortium Database #

Lincoln D. Stein, Ontario Institute for Cancer Research, Canada #

Abstract #

The International Cancer Genome Consortium (ICGC; is a multinational effort to identify patterns of germline and somatic genomic variation in the major cancer types. Currently consisting of 71 cancer-specific projects spanning 18 different countries, ICGC has sequenced the tumor and normal genomes of over 10,000 donors (>20,000 genomes). When the current phase of the project is completed in 2018, we expect to have sequenced more than 25,000 donors.

All analyzed data from the project is available to the public, including clinical information about the donors, somatic mutations identified in the tumors, and the potential functional significance of these mutations. The raw sequencing data and other potentially-identifiable information is available to researchers who have signed an agreement promising not to attempt to identify the donors. The total data set is now 500 terabytes in size, but growing rapidly as the project switches from exome sequencing (sequencing just the transcribed regions of the genome) to whole-genome sequencing. We anticipate that the full data set will be on the order of 10 petabytes.

To maximize the utility of the data to the public, the analyzed data is available at the ICGC data portal (, where users can browse donors, mutations and genes using an attractive highperformance web application based on Elastic Search at the backend and AngularJS and D3.js on the front end. The portal uses faceted search as its dominant user interface metaphor. This allows researchers to pose general queries, such as “find all non-synonymous mutations” and then successively refine them “…affecting genes in the hedgehog pathway”, “…affecting donors with stage I disease.” A series of interactive graphics allows researchers to readily compare different sets of mutations, donors and genes.

A limitation of ICGC is that the raw sequencing data must still be downloaded from a static file repository. We are addressing this limitation by moving the data into the compute cloud, where software and data can be co-resident. In the Whole Genome Pan-Cancer Analysis Project, which began earlier this year, 2000 whole genome pairs from ICGC are being placed into several compute cloud analysis facilities to allow for uniform mutation-calling and data mining by ICGC researchers. In the “Cancer Genome Collaboratory”, a project just approved in March 2014, we will be placing the entire ICGC data set into two compute cloud centers for access by the general research community. I will talk about the challenges and solutions that we are working on in connection to these two projects.

Notes #

ICGC Project

Simple experimental design:

ICGC db growing in size - moved from exome sequencing to whole genome

Available to public - Website @

Original Database - based on BioMart

September - complete rewrite of entire dcc (Ferretti). Heavy use of distributed computing.


What about raw read data?

The solution => The Pan-Cancer Whole Genome Analysis Project (PAWG/Pan-Can) #

Technologies: OpenStack (5 centers) and vCloud (EBI)



  1. Legal - regional differences have not gone away. Datasets from TCGA (us) can be hosted by certain US based institutions trusted by NIH. NIH has not approved phase II of the project due to the way the consent was written. It can be interpreted as ‘not allowed to use on cloud’ (But cloud didn’t exist when the consent was written). Europe - some countries are sensitive to distribute their data to US based data centres (Snowden & NSA).
  2. Technical - adapting grid based hpcs to use cloud-based technologies. Running 8 weeks behind

Why not a commercial cloud? Amazon, Google, MS

What happens when Pan-Can is done ~ 1 year? The group has received funding from Canadian funders: The Cancer Genome Collaboratory

Q & A #

Q: (Ware) Many of us have been using BioMart and the scalability - how portable is your new system as a replacement for BioMart?

A: on a scale of 0 - 100: -1. This is a highly specialized system designed just to work with our data. Biomart is alive and well in Italy

Q: What cancer types were chosen for the pan-cancer analysis? And why?

A: Our criteria for inclusion is at least 30x coverage for whole genome, tumor normal pair, proper consent from donor.
Of that, we have ovarian, breast, lung, pancreatic, liver, leukemias – about 13 in all
The final list of tumor types won’t be selected till we’ve qc'ed al the data and know what the distribution is

*Q: If the 10PB of data that will be generated will be harmful - look at quality compression and other *

A: No chance that we’ll be storing adn distributing full uncompressed 10PB. Actively benchmarking compression systems. Hopefully get it down to a few PB without loss of information

Q: What is the main objective of this project? Biological objective?

A: The main biological object - focusing on patterns of alteration in non-coding regions. E.g. know there are mutations in regulatory regions - we haven’t characterized.
groups looking at:

  1. Looking at regulatory networks - interactions wiht coding regions.
  2. Patterns of rearrangement
  3. Evidence of insertion of known and unknown pathogens / virus that may be driving the tumours

Looking at this in a uniform way we’ll learn common mechanism and mechanisms that are distinct

Q: How willing are your users to get random samples in return as opposed to the full data? Plus confidence score

A: Key method of access - take slices of the raw data in the region that you’re interested in. Or extend and do a random sampling - feature available of CGHub and widely used. Not a feature of EGA - annoying deficit. One of the reasons we want to move away.

Q: Majority of researchers - don’t need to develop alignment algorithms. Are processed data available to researchers?

A: The interpreted data (still large, but much smaller - in GB not TB) is available for browsing and dl and abstraction and available from

Q: Curious how you are designing your APIs? APIs for visualization are different from tools

A: Start with the user interface, figure out what it needs to display, and work back to the API. A genome browser has a very different api than the faceted browser where you’re looking at a particular biological pathway. Specialized APIs and indexes for each of those.

back to the speaker list →

The Genographic Project #

Genome Analytics with IBM Watson #

Ajay Royyuru, IBM T.J. Watson Research Center, USA

Director of computational biology

Abstract #

// last minute topic change, no published abstract

Press release:

Notes #

Research group at IBM - very focused on computational biology.

Intersection of everything IT and Life Sciences.

3 pillars of work (IBM computational biology)

  1. managing and analyzing the data explosion - makes biology more amenable to quantitative outcomes
  2. predicting biological outcomes with scale of computing
  3. dealing with complexity. DREAM - IBM team with community is heavily involved


What are we up to: Gathering raw sequencing input, through large number of steps so that we will eventualy get useful info that may lead to action

3 pillars in the journey of genomic medicine

  1. sequencing (includes downstream analysis - variant calling)
  2. translational medicine (have VCF) <– will focus on this piece (VCF to actionable)
  3. Actionable intelligence - Personalized healthcare. Something publishable is our goal
Translational Medicine: #

System that generates insights


  1. data coming from sequencing (VCF) - patient specific information
  2. Entirety of what you can point Watson to - All available biological knowledge (PubMed, NCI PDQ)

All this is ingested. Running on IBM’s cloud layer (SoftLayer) - large/global/scalable/acquired by IBM.
Generates some actionable insights.
Goal: this goes to tumor oncologists, look at data in context of decision trying to make. Hopefully make informed correct decision.

IBM Watson #
Genomics - not just about genes. How we connect that knowledge #

The traditional way: read papers, develop hypotheses -> interpretation -> actionable output. Can we automate this? Can we come up with new research approaches from the literature?

p53 project example - ingest a lot - mine the literature. #

Application to genomics:
on SoftLayer, physican managing cases (biopsy samples) submission - uploading VCF.
What analysis can be done -

Summary: researcher can browse. print report for the record.

Looking for active collaborations - dont’ generate this data themselves

Sample report- generated with early data

e.g. Lidocaine in report: not prepared to see this in here

Q & A #

Q: (Ouellette) Do you have any evidence on how Watson will do if it read full papers (not just abstracts)?

A: Not tested in this context. Watson does read full papers in a clinical context

Q: (Mesirov) -

1. Are you aiming with that package towards the practicing oncologists or the research physician?

2. To what extent have you compared what Watson is able to mine from the data with other approaches/algorithms/packages published and available to the community?


  1. It’s a journey - early adopters, research clinicians who have the expertise and interest to be partners. A lot of learning. For example, Watson shows lots of evidence. You need a clinician research who understands the subtleties of the research and how to make decisions that will be useful
  2. Not whole scale comparison yet - still in ingest and build mode. Some benchmarking and testing - working on the baseline. Full scale comparison for later. Watson can also do chemical extraction - full scale comparison here.


  1. Is there any way to integrate other sources of information not text based? Images? Protein structures?
  2. human value added in human curation databases?


  1. Image analytics is an interest to us. Study going on here. Working with some large medical institutions on this project.
  2. Melding between machine and human curation -> this accelerates the process. Makes it more usable.

Q: Doubts whether practicing physician will know what VCF is, understand Cicos plot? Watson to user or user to Watson?

A: Initial set of end users - clinician researchers. They got the sample, they know what VCF is. This is the community that will find this useful. What can we simplify to make this more useable.
Right now, collaboration.

back to the speaker list →

Human Genome Analysis #

Mark Gerstein, Yale University, USA #

Director: computational biology
ENCODE, 1000 genomes

Abstract #

Plummeting sequencing costs have led to a great increase in the number of personal genomes. Interpreting the large number of variants in them, particularly in non-coding regions, is a central challenge for genomics.

One data science construct that is particularly useful for genome interpretation is networks. My talk will be concerned with the analysis of networks and the use of networks as a “next-generation annotation” for interpreting personal genomes. I will initially describe current approaches to genome annotation in terms of one-dimensional browser tracks. Here I will discuss approaches for annotating pseudogenes and also for developing predictive models for gene expression.

Then I will describe various aspects of networks. In particular, I will touch on the following topics: (1) I will show how analyzing the structure of the regulatory network indicates that it has a hierarchical layout with the “middle-managers” acting as information-flow bottlenecks and with more “influential” TFs on top. (2) I will show that most human variation occurs at the periphery of the network. (3) I will compare the topology and variation of the regulatory network to the call graph of a computer operating system, showing that they have different patterns of variation. (4) I will talk about web-based tools for the analysis of networks (TopNet and tYNA).

Architecture of the human regulatory network derived from ENCODE data.

Gerstein et al. Nature 489: 91

Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors.

KY Yip et al. (2012). Genome Biol 13: R48.

Understanding transcriptional regulation by integrative analysis of transcription factor binding data.

C Cheng et al. (2012). Genome Res 22: 1658-67.

The GENCODE pseudogene resource.

B Pei et al. (2012). Genome Biol 13: R51.

Comparing genomes to computer operating systems in terms of the topology and evolution of their regulatory control networks.

KK Yan et al. (2010). Proc Natl Acad Sci U S A 107:9186-91.

Slides #–20140325-i0keybdata/

Notes #

My perspective on Big Data #
What do people do with big data? #

Take this data to answer a question, make a prediction, modelling

Two ways to approach:

  1. don’t care about structure, just want answer (google search)
  2. with explicit organization of dataset (google maps, google earth)

In science - search for Higgs boson - searching through many for a few needles (fits in #1)

In genomics - we’re in #2

How we can organize information in genomics - networks #
What is genome annotation? #

Tracks in genome browser - linear view of how to think of genome.
How will this scale with thousands of tracks? No

What type of information do we want? Actually thinking of 3D molecules - but not quite possible

Network diagram - middle ground

analyze networks - network science

Directed entities - regulatory networks

Other organisms: Yeast genome #

Similar, but has four levels. Multi-regulated network with bottlenecks

Different types of hierarchies

  1. autocratic (military)
  2. democratic (things at top mostly regulating, bottom mostly being regulated)
  3. intermediate - between the two. Ease some information bottlenecks

Developed a scheme to measure the degree of x-linking structure. Degree of collaboration

Compare humans w/ E. coli & yeast & rat: humans more collaborative nodes

Yeast network similar structure to government hierarchy w/ middle managers: matches gov’t of Macao

Social science - literature on people studying how important you need middle managers talking to each other

Variation network

Analogy to understand more connectivity -> more constraint #

Comparison between e. coli regulatory network and Linux OS

Applications of more connectivity leads to more constraint - no time to talk today. Building a practical workflow & tool for disease genomes.

Network stuff available -

Q & A #

Q: (Stein) you showed this relationship between Hub-ness and Kernel call graph. Have you looked at the evolution of the call signature? Highly connected subroutines do not have their call signature called frequently - more similar to bio

A: No, very interested in that. Evolution - even package dependencies.

Q: Information flow: makes sense in regulatory networkers. What’s your reasoning with protein-protein networks?

A: Some times of protein-protein interaction networks, but other times not so much. Key network params - regulatory, focused on bottlenecks. Protein-protein - focus on hubs. When you do the correlations of connectivity with constraint - more on bottlenecks.

Q: Interested in E. coli v. linux - we compare a lot to engineering ideas

A: Maybe not a lot of engineering ideas apply to biology. Sometimes people look at biological networks to apply to engineering problems

Q: have you looked at hubs in organisms with recent genome duplications to see how they occur?

A: genome duplicates, suddenly have these two things interact with your hub or what’s there. Lots of network literature on scale free networks - plays into that.

Q: What do you think about the cell type specificity - do you think different cells depending on their needs will have different hierarchies?

A: Controversy in how I present this. Cell type non-specific hierarchy - this is a global wire diagram. In my mind, if you go to certain cell time, certain lights turn on. Other view - cell type specific hierarchies. I think this doesn’t make sense - no one talks about gene list

back to the speaker list →

The BioCompute Farm: Colocated Compute for Cancer Genomics #

Stuart Young, Annai Systems Inc., USA #

Abstract #

Pedabyte-scale genomic data repositories such as the Cancer Genomics Hub (CGHub) require collocated compute resources to fully leverage the value of the genomic data. The traditional model of data download from a repository to a research center followed by local computational analysis suffers from high file transfer costs, significant delays and file storage problems. The BioCompute Farm, a highly-scalable computing resource colocated with CGHub, provides a 99.9% reduction in data storage and 120 times reduction in time for analysis of all 40TB of the current Cancer Genome Atlas (TCGA) RNA-Seq data set. The BioCompute Farm combines high-speed BAM slicing for DNA analysis and the latest in bioinformatics tools and standardized pipelines with the flexibility to customize pipelines and rapidly scale up computational capacity to meet the needs of cancer researchers. As data growth continues to outpace the growth of Internet bandwidth, the BioCompute Farm can serve as a model for the emerging paradigm of colocated compute resources serving the users of large genomic databases.

Notes #

Motivation for talk: why colocated compute #
Annai systems: director of bioinformatics #

How big is this problem?

Survey by NCI - wish list for cancer genomics researchers

NCI’s colocation model

BioCompute Farm (TCGA data)


HIPPA Compliance

Provide farms with subscription based access

Provide custom analysis

Collaborations #

PanCancer project

DREAM challenge


Conclusion #

Q & A #

Q: Only 5 or 10 labs in the world are interested in whole PB scale data. I think if we make the VCF file available - this should be sufficient for most researchers.

A: I think with the way things are going, the issue is not only going to be huge data access, but secure access, and how can we search through the data to find the datasets you want.

Q: Most of the pipelines are focused on variant calling, alignments - what are the priorities for what’s next?

A: Yes, it’s variant calling right now. One other area of interest- systems approach, pathways, integrating different types of data. Looking at different standards, read pathology or clinical data. Hospital data is very rich for researchers, but not very accessible. Looking at integrating with genomic data.

back to the speaker list →

Short Talk: An Overview of the Bionimbus Protected Data Cloud #

Robert L. Grossman, University of Chicago, USA #

Abstract #

Bionimbus is a petabyte scale community cloud for managing, analyzing and sharing large genomics datasets that is operated by the not-for-profit Open Cloud Consortium. With a cloud computing model, large genomic datasets can be analyzed in place without the necessity of moving it to your local institution. Bionimbus contains a variety of open access datasets, including ENCODE and the 1000 Genomes dataset. In 2013, we updated Bionimbus so that researchers can analyze data from controlled access datasets, such as The Cancer Genome Atlas (TCGA) in a secure and compliant fashion. We describe some case studies using Bionimbus, some of the bioinformatics tools available with Bionimbus, some different ways of interoperating with Bionimbus, the Bionimbus architecture, and the security and compliance framework.

The Bionimbus Protected Data Cloud is supported in by part by NIH/NCI (grant NIH/SAIC Contract 13XS021 / HHSN261200800001E), the Gordon and Betty Moore Foundation, and the National Science Foundation (Grants OISE - 1129076 and CISE 1127316).

Notes #

I’m going to pose a few questions. In the next 10 min I will not try to answer them. Hopefully your answers will be more interesting than mine. I will give you a framework of how we think of big data.

Four questions #
  1. Is big data in bioinfo/biomed any different than big data in science. Is big data in science any different from big data general?
  2. what instrument should we use to make discoveries over big biomed data?
  3. do we need new types of mathematical and stat models for big biomed data?
  4. how do we organize our data?
Bionimbus protected data cloud #

Supporting Pan-Can analysis - open source core

Right now process of scaling up

Open science data cloud #

biomedical commons cloud

New era #

What are the new models? '72 phil anderson wrote a piece: is more different

How do you scale machine learning to data centers?

Is more different at this scale? And if so, how do we discover it?

Q & A #

Q: (Ware) as you see these data centres emerging, do you think they’ll focus on specific questions? How do you see the data centres forming?

A: The ones I mentioned are around cancer genomics. Sustainability and payment - putting small taxes on certain of our projects so that we can make larger amounts of our data available. Driven by some funding agencies. There’s a certain interest of private donors funding certain parts of this. Some economic incentives. Some combination of that is going to change the way we do science.

back to the speaker list →

Short Talk: Pan-Cancer Analysis of Somatic Variation from Whole Genome ICGC / TCGA Datasets #

Adam Butler, Wellcome Trust Sanger Institute, UK #

Abstract #

The advent of massively parallel sequencing technology has revolutionised the way we characterise cancer genomes and provided new insights in our understanding of the mechanisms of oncogenesis. The International Cancer Genome Consortium (ICGC) was instigated in 2007 with the aim to systematically screen hundreds of Cancer Genomes for 50 distinct tumour types and catalogue the somatic variation present. This endeavor aims to prevent duplication of effort, ensure rare tumours are included and generate large datasets for the scientific community. A similar project is underway in the USA, The Cancer Genome Atlas (TCGA).

In late 2013 at the ICGC conference in Toronto, Peter Campbell announced an ambitious plan to undertake a Pan-Cancer analysis of whole genome data available from ICGC and TCGA. This would provide a comprehensive dataset of somatic variant calls with standardised output for 2,000 cancer genomes, which will be available for subsequent downstream analyses.

The primary analysis will include detection of somatic point mutations, small insertions and deletions, copy number changes, rearrangements and retrotransposon/viral integration sites. To ensure integrity of the dataset, three independent analysis pipelines, provided by the Broad Institute, DKFZ and the Sanger Institute, will be utilised. The data will be generated and stored at 6 data centres around the world; Spain, Germany, Japan, UK, and two centres in the USA.

The Sanger Institutes contribution to this initiative is to provide our analysis pipeline as one of three to be run over the data. Consequently our algorithms have been assessed via rigorous comparison with comparable software and their performance optimised. The pipeline is currently being ported into a VM (Virtual Machine), automated and the code adapted for running all variant detection analyses within a cloud environment.

The primary analysis will deliver a high-quality catalogue of somatic variants in a standardised VCF format and made available from the six centres for downstream investigation.

Notes #

Go over our part and experience with the Pan-Cancer analysis with large datasets

The Cancer Genome Project #


PanCancer - large international collaboration #

CGP -> PanCancer

Phase 1

Useful outcomes #

optimized callers

Q & A #

Q: (Ware) interested in optimization for indels - can you push that any further? Many of our bottlenecks are in aligners built for human (work in plant)

A: What’s it written in? Perl/Java - eyes roll back in heads and they start shaking. Joking aside, with Caveman (substitution caller) - given someone the time to go back and just re-code proved to give us massive improvement. Recoded in C. Not glamorous or groundbreaking - C really is faster.

back to the speaker list → #

Short Talk: Extensive Variation in Chromatin States Across Humans #

Maya M. Kasowski, Yale University, USA #

Abstract #

The majority of disease-associated variants lie outside protein-coding regions, suggesting a link between variation in regulatory regions and disease predisposition. We studied differences in chromatin states using five histone modifications, cohesin, and CTCF in lymphoblastoid lines from 19 individuals of diverse ancestry. We found extensive signal variation in regulatory regions, which often switch between active and repressed states across individuals. Enhancer activity is particularly diverse among individuals, whereas gene expression remains relatively stable. Chromatin variability shows genetic inheritance in trios, correlates with genetic variation and population divergence, and is associated with disruptions of transcription factor binding motifs. Overall, our results provide insights into chromatin variation among humans.

Notes #

Chromatic variation among people

What makes people different? #
Differences in histone marks differences in gene expression? #


Used HapMap populations - 19 individuals

How much variation in chromatin among individuals? #

There’s an enhancer that is active in caucasian and 2 asians, but not africans - SNP in NFkB motif

Striking variation - more than 30% variation at some marks

Combinatorial - chromatin states based on combinations of the marks

Found that it was more meaningful to ask whether a particular mark varies in the context in a particular state than overall

Do states switch among individuals?

Genetic basis of variation

Possible mechanism - differences in TF binding motifs

Functional consequences:

Q & A #

Q: (Ware) epigenetic change- were you able to use those as biomarkers and retest GWAS? Uncover hidden variation?

A: Haven’t look at that. This study, 19 individuals. But as we up the scale, perhaps.

Q: Did you look at the trios to see if there’s more concordance among their epigenetic marks than you would have expected on the basis of shared SNPs?

A: Didn’t look at that, we had two trios.

back to the speaker list →


Now read this

How to bring open source to a closed community

This is (roughly) a transcript of my talk at Strange Loop this year! At least, it’s what I meant to say. Watch the video for all the fun Canada facts and nervous rambling. Slides made using reveal.js. Screenshots captured using Decktape... Continue →