Big Data in Biology: Databases and Clouds

Series Introduction: I attended the Keystone Symposia Conference: Big Data in Biology as the Conference Assistant last week. I set up an Etherpad during the meeting to take live notes during the sessions. I’ve compiled all the abstracts, notes and slides (where available) here. Shout-out to David Kuo for helping edit the notes.

Warning: These notes are somewhat incomplete and mostly written in broken english

 Databases and Clouds

Monday, March 24th, 2014 9:30am - 2:15pm

 Speaker list

Laura Clarke, European Bioinformatics Institute, UK

The 1000 Genomes Project, Community Access and Management for Large Scale Public Data -

Dan Stanzione, University of Texas at Austin, USA

The iPlant Collaborative: Cyberinfrastructure for 21st Century Biology -

Jill P. Mesirov, Broad Institute, USA

GenomeSpace: A Community Web Environment for Genomic Analysis Across Diverse Bioinformatic Tools -

Ronald C. Taylor, Pacific Northwest National Laboratory, USA (replaced by Francis Ouellette)

FGED: The Functional Genomics Data Society -

Andrew Carroll, DNAnexus, USA

Insights from the Genomic Analysis of 10,940 Exomes and 3,751 Whole Genomes Demystifying Running at Scale and the Scientific -

Michael Schatz, Cold Spring Harbor Laboratory, USA

The Next 10 Years of Quantitative Biology -

 The 1000 Genomes Project, Community Access and Management for Large Scale Public Data

 Laura Clarke, European Bioinformatics Institute, UK


The 1000 genomes data continues to be the largest public variation resource available to the community. Providing coherent and useful resources based on this data continues to be a key goal for the project Data Coordination Center (DCC).

The resource now stands more than 500 Tbytes in size and nearly 500,000 files on the ftp site this presents challenges both for us to manage and for users to discovery what data we have available.

Here I both describe these challenges and present the solutions and tools the project has created to enable the widest level of usefulness for the 1000genomes project data.


 1000 genomes project


Analysis Approach

In final phase now


Data Transfer

Within Consortium Data Exchange

Public Accessibility

Tools to Assist Data Use

 Q & A

Q: 1000 genomes project - many 340bp all deletions without insertions?

A: Quality - false discovery rate <5%. Sturctural variant very difficult. Wasn’t sufficiently confident in structural variations that aren’t deletions - did not include in db. Structural variations will always be more limited.

Q: Idea of a data freeze and recall - uuid, public key trust network - possible route?

A: sounds like a good idea

back to the speaker list →

 The iPlant Collaborative: Cyberinfrastructure for 21st Century Biology

 Dan Stanzione, University of Texas at Austin, USA


iPlant is a new kind of virtual organization, a cyberinfrastructure (CI) collaborative created to catalyze progress in computationally-based discovery in plant biology. iPlant has created a comprehensive and widely used CI, driven by community needs, and adopted by a number of large-scale informatics projects and thousands of individual users. iPlant holds more than 1.5 petabytes of user data comprising several hundred million files today, and is thus deeply involved in the “Big Data” challenges of biologists, from storing to analyzing to sharing rapidly growing amounts of data.

This talk will outline the iPlant CI, and discuss what iPlant is doing today to address data challenges, as well as plans for the future. The talk will also address trends the project sees in how users are handling data, and the potential technological solution on the horizon to address them.

iPlant is supported by the National Science Foundation via Award #DBI-1265383.


iPlant - co-director (until 8 weeks ago). Passed co-director to Matthew W. Vaughn

What is iPlant:
community driven organization building cyberinfrastructure for the plant (and animal) science


combination of computing, data storage, networking and humans.
to achieve some scientific goal


Achievements through iPlants open infrastructure

 iPlant Services

Powered by iPlant

Workflow Optimization and Consulting

Democratizing access to high-throughput genome annotation

 Data store:
 Big Data Observations

Soap Box

Speed things up

 Q & A

Q: (illumina) Are there tools to analyze applications to determine their lack of efficiencies?

A: Yes, there are. Caveats: some tools - perfexpert (tooling and analysis) - low level performance tools. Not as useful with non-low level languages. Not great for python.
Build job stats on system - can tell you efficiencies of your code on their system.

Q: (Mesirov) What’s your process on who gets to use it, who doesn’t?

A: iPlant: all resources NSF funded. some EXSEED. xrack - any open science funded researcher. Must be US and published.
iPlant - will open up under 10K hours. tiers on higher use, compare with other users.

back to the speaker list →

 GenomeSpace: A Community Web Environment for Genomic Analysis Across Diverse Bioinformatic Tools

 Jill P. Mesirov, CIO at Broad Institute, USA


Over the last two decades genomics has accelerated at an exponential pace, driven by new sequencing and other genomic technologies, promising to transform biomedical research. These data offer a new era of potential for the understanding of the basic mechanisms of disease and identification of novel treatments. Concurrently, there has been a growing emphasis on integrating all of the available data types to better inform scientific discovery. There are now thousands of bioinformatic analysis and visualization tools for this wealth of data. To leverage these tools to make biomedical discoveries, biologists must be empowered to access them and combine them in creative ways to explore their data. However, this vision has been out of reach for almost all biomedical researchers.

We will describe and give example applications of GenomeSpace,, an open environment that brings together a community of 14 diverse computational genomics tools and data sources, and enables scientists to easily combine their capabilities without the need to write scripts or programs. Begun as a collaboration of six core bioinformatics tools - Cytoscape (UCSD), Galaxy (Penn State University), GenePattern (Broad Institute), Genomica (Weitzmann Institute), the Integrative Genomics Viewer (Broad Institute), and the UCSC Genome Browser (UCSC) - the GenomeSpace community continues to grow. GenomeSpace features support for cloud-based data storage and analysis, multi-tool analytic workflows, automatic conversion of data formats, and ease of connecting new tools to the environment.
Funding provided by NHGRI and Amazon Web Services


 GenomeSpace - fairly recent project


 What do we need to take advantage?

integrate large data sets and multiple data types.
data management/identification - how do find what helps me?

more complex workflows and algorithms

visualize large integrated data sets:
viewers, help us look at reads and see if that call makes sense

validate computational results

 Will focus on -> More complex workflows/algorithsm

Integrative genomics

Difficulty of getting these tools to work together - need to develop infrastructure.
Challenge: flood of data & proliferation of tools

One monolithic tools OR cooperative approach

 How do I use it?

Go to cookbook for: how to build a more complex analysis,
How to leverage these different tools

genome space recipe collection

 Join the community!

open source, on bitbucket

 Q & A

Q: (Stein) loved the recipes. Regular recipes still work 50 years later (broccoli doesn’t change). Bioinformatics paper 10 years ago will not work. How much time and effort is required to create a recipe in an environment where tools will be updated? Will it work in 5 years?

A: Tried to limit the scope of the recipes - not beginning to end paper. More simple - just 2 or 3 tools. Committed to setting up steering committee for recipe collection to keep them honest.
RNASeq - many are beginning to use in their work. Yet - methods for analyzing RNASeq hasn’t been settled. Challenge they recognize. Community resource - users can report when recipes aren’t working. Go to forum.

Q: (illumina) Data from different sources, does GenomeSpace provide info on challenges on combining different data?


Can do: put warnings. Watch out for the follow… etc. People who develop these recipes much understand the workflow fairly well so they know the gotchas.

Can’t do: cannot anticipate all the ways in which a biologist will misuse resource
People mis-use tools. Try to give enough info and warning to keep the probability low.

Q: followup: Account for differences in platforms?

A: Don’t have funding for all, but we do contact vendors.

Q: Thank you for making something more user friendly!

Q: Clinical data - do you have the security to handle this?

A: Security that Amazon Cloud provides. New round of funding: agreed to put warnings for ppl who are uploading data. If you have data that needs to be kept private - can use your own Amazon S3/Dropbox.

GenomeSpace does not do analysis - it’s on the tools.

Q: (IBM - Royyuru) Reproducibility - read about a tool in a paper, but can’t reproduce. Can GenomeSpace add machine readable script to run the tool?

A: Can’t go into tools themselves - lightweight. Will talk offline.

back to the speaker list →

 FGED: The Functional Genomics Data Society

 Francis Ouellette, Ontario Institute for Cancer Research, Canada

Selected on merit - not invited talk. Ron has laryngitis - Francis Ouellette is presenting slides.


The Functional Genomics Data Society (FGED) Society, founded in 1999 as the MGED Society, is a registered International Society that advocates for open access to genomic data sets and works towards providing concrete solutions to achieve this. Our mission is to be a positive agent of change in the effective sharing and reproducibility of functional genomic data. Our work on defining minimum information specifications for reporting data in functional genomics papers (e.g., MIAME) have already enabled large data sets to be used and reused to their greater potential in biological and medical research. The FGED Society seeks to promote mechanisms to improve the reviewing process of functional genomics publications. We also work with other organizations to develop standards for biological research data quality, annotation and exchange. We actively develop methods to facilitate the creation and use of software tools that build on these standards and allow researchers to annotate and share their data easily. We promote scientific discovery that is driven by biological research efforts in data integration and meta-analysis.


Spirit of openness - share everything

 Functional Genomics Data Society & Its Mission

In the beginning there were microarrays - MGED

MIAME - standard for exchange raw data microarray

MINSEQE - minimal standards on nucleotide seq experiment.
General description of the aim, metadata, raw reads, processed data

FGED Standards: big data needs standards, GFED creates and aids the development of such

FGED is an open society, welcome feedback, input and volunteers

 Q & A

Q: (Stein) What is the journal policy in the continued evolution of this effort?

A: Publishers in general have very great interest and support. They are looking for things like this. PLoS - new data release policy. Publishers keen to see what community agreed upon standards are.

back to the speaker list →

 Insights from the Genomic Analysis of 10,940 Exomes and 3,751 Whole Genomes Demystifying Running at Scale and the Scientific Results

 Andrew Carroll, DNAnexus, USA


As one of five institutions participating in the global CHARGE Consortium, the Human Genome Sequencing Center (HGSC) at Baylor College of Medicine needed a compute and data management infrastructure solution to handle the massive amount of data (3,751 whole genomes and 10,940 exomes) they would be processing for this project. The large burst computational demands for this project would have unacceptably taxed existing resources, requiring either many months of using spare capacity or forcing other users off the cluster for 4-5 weeks to complete it faster. To address this challenge, HGSC, DNAnexus, and Amazon Web Services (AWS) teamed up to deploy a cloud-based infrastructure that could handle this ultra large-scale genomic analysis project quickly and flexibly, with zero capital investment. At the project’s peak, HGSC was able to spin up more than 20,000 cores on-demand in order to run the analysis pipeline of the CHARGE data. During this period, HGSC was running one of the largest genomics analysis clusters in the world.


DNAnexus - 2009 spin out from Stanford. Darling of sucessful startups. Apply the Cloud at scale

Two parts:

  1. Philosophy of the Cloud
  2. Application to large project (10-11K exomes)
 What is DNAnexus
 Scientific Vision:

Challenges looming over data @ scale

Science like driving

Tool development - profile runtimes and cost


Tool Optimization

Benchmark sets


 DNAnexus - HGSC-CHARGE Collaboration

Analysis of 11K exomes and 4K whole genomes for CHARGE consortium.
Comput scale and distribution of results across 300 investigators

Baylor: 20 HiSeqs ~25TB of sequence per month

 Q & A

Q: (Schatz) On projects like this the first half is well structured, but gets very ad-hoc by the end. How is this structured in DNAnexus for ad-hoc queries?

A: We take advantage of the expertise of the ppl working with us. Relying on the CHARGE consortium in collaboration. Directed hypothesis generated by partners.

Q: Can you elaborate on the datasets you’re using as benchmarks?

A: An oppourtunity for the community to come together - benchmarking sets are the way to go, DNAnexus gives us an oppourity to go in this space. Not curators of benchmarks sets.

back to the speaker list →

 The Next 10 Years of Quantitative Biology

 Michael Schatz, Cold Spring Harbor Laboratory, USA


Topic change, no abstract



 Questions in Biology - some broad, some focused

Interesting things about these questions - there is no single instrument that answers each of these questions

Answer these questions:

 Bottom tier - sensors : Cost per Genome - drives much of the talks today. need scalability

DNA data tsunami - growth of sequencing around 3x per year

Sequencing Centres map - will be roughly the same

compression will help - need to be aggressive about throwing out data

major applications:

 Next phase - compute, algorithms

compute - parallel algorithm spectrum

Better hardware:

Crossbow - algorithm on map reduce

PanGenome alignment and assembly

See major informatics centers on topics

 Top of slice: Results: work at CSHL - genetics of autism

Sample set: 3K families - simplex families

SCALPEL - find indels from short read sequencing data

Experimental analysis and validation

de novo genetics of autism - same number of mutations as siblings

available in bio archive, code available in SourceForge

 Potential for big data

 Power from data aggregation - champion ourselves and the future

What is a data scientist? Many fields. To be really successful, you need strengths, experience and expertise in these fields.

 Q & A

Q: Observation: Talking about the sequencing coming down in price - What happens when sequencing becomes so cheap and democratized that any can do this? How do we as a community get the legislature to start thinking of these privacy concern? We need to look at this data

A: No simple answer. Part of it will come through scientific discoveries - congressmen pay attention when there’s big breakthroughs. Lobby - we need to talk to the rest of the world. Part of it going to come in reponse when there are outbreaks - when data is abused. There’s already some legislation in place so you cna’t get discriminated against for, say, insurance. But there’s implicit discriminations. Don’t know how to fix outside of education and reaching out to the next gen.

Q: (Mesirov)

  1. Congratulations: terrific meeting!
  2. 30+ years ago I heard Grace Murray Hopper speak - made a comment about how we are all going to be drowning in data. All kinds of data. I appreciated your comment on what we keep. Important: we have some kind of metric of utility - huge amounts of it not touched for long periods of time. Think about what happens with this data that is never used again. Otherwise we’re all going to drown

A: The utility of data is certainly something to be considered. We’ve bad at estimating it. We’re all hoarders. System failing recently- can’t copy off a PB of data fast enough. Trying to assess the preciousness of data and time. Some metrics are hard to measure. I anticipate the storage vendors will get better at providing tools to assess what is on a filesystem. Tools today are crude, i hope these will improve. At the very least we can identify if there are big datasets we haven’t accessed in years

Q: (Swedlow) At Dundee, hierarchical filesystems backed up by tape. Primary data is images and proteomics - 95% of it is not touched again 3 months later. Graph representations of sequences - we will be doing the same thing with images. Concerned with the computational cost of recalculating these graphs. How expensive will recalculation be?

A: today it’s expensive - but this is an opportunity for research. For example: at level of suffix trees - construction methods. We can dust those off and improve algorithms.

back to the speaker list →


Now read this

Big Data in Biology: Personal Genomes

Series Introduction: I attended the Keystone Symposia Conference: Big Data in Biology as the Conference Assistant last week. I set up an Etherpad during the meeting to take live notes during the sessions. I’ve compiled all the abstracts,... Continue →