April 1, 2014

Big Data in Biology: Personal Genomes

Series Introduction: I attended the Keystone Symposia Conference: Big Data in Biology as the Conference Assistant last week. I set up an Etherpad during the meeting to take live notes during the sessions. I’ve compiled all the abstracts, notes and slides (where available) here. Shout-out to David Kuo for helping edit the notes.

Warning: These notes are somewhat incomplete and mostly written in broken english

Personal Genomes #

Tuesday, March 25th, 2014 8:30am - 12:00pm

http://ks.eventmobi.com/14f2/agenda/35704/288359

Speaker list #

Lincoln D. Stein, Ontario Institute for Cancer Research, Canada

The International Cancer Genome Consortium Database -
[Abstract]
[Q&A]

Ajay Royyuru, IBM T.J. Watson Research Center, USA

Genome Analytics with IBM Watson -
[Abstract]
[Q&A]

Mark Gerstein, Yale University, USA

Human Genome Analysis -
[Abstract]
[Q&A]
[slides]

Stuart Young, Annai Systems Inc., USA

The BioCompute Farm: Colocated Compute for Cancer Genomics -
[Abstract]
[Q&A]

Adam Butler, Wellcome Trust Sanger Institute, UK

Short Talk: Pan-Cancer Analysis of Somatic Variation from Whole Genome ICGC / TCGA Datasets -
[Abstract]
[Q&A]

Maya M. Kasowski, Yale University, USA

Short Talk: Extensive Variation in Chromatin States Across Humans -
[Abstract]
[Q&A]

Robert L. Grossman, University of Chicago, USA

Short Talk: An Overview of the Bionimbus Protected Data Cloud -
[Abstract]
[Q&A]

The International Cancer Genome Consortium Database #

Lincoln D. Stein, Ontario Institute for Cancer Research, Canada #

Abstract #

The International Cancer Genome Consortium (ICGC; www.icgc.org) http://www.icgc.org/ is a multinational effort to identify patterns of germline and somatic genomic variation in the major cancer types. Currently consisting of 71 cancer-specific projects spanning 18 different countries, ICGC has sequenced the tumor and normal genomes of over 10,000 donors (>20,000 genomes). When the current phase of the project is completed in 2018, we expect to have sequenced more than 25,000 donors.

All analyzed data from the project is available to the public, including clinical information about the donors, somatic mutations identified in the tumors, and the potential functional significance of these mutations. The raw sequencing data and other potentially-identifiable information is available to researchers who have signed an agreement promising not to attempt to identify the donors. The total data set is now 500 terabytes in size, but growing rapidly as the project switches from exome sequencing (sequencing just the transcribed regions of the genome) to whole-genome sequencing. We anticipate that the full data set will be on the order of 10 petabytes.

To maximize the utility of the data to the public, the analyzed data is available at the ICGC data portal (dcc.icgc.org) http://dcc.icgc.org/, where users can browse donors, mutations and genes using an attractive highperformance web application based on Elastic Search at the backend and AngularJS and D3.js on the front end. The portal uses faceted search as its dominant user interface metaphor. This allows researchers to pose general queries, such as “find all non-synonymous mutations” and then successively refine them “…affecting genes in the hedgehog pathway”, “…affecting donors with stage I disease.” A series of interactive graphics allows researchers to readily compare different sets of mutations, donors and genes.

A limitation of ICGC is that the raw sequencing data must still be downloaded from a static file repository. We are addressing this limitation by moving the data into the compute cloud, where software and data can be co-resident. In the Whole Genome Pan-Cancer Analysis Project, which began earlier this year, 2000 whole genome pairs from ICGC are being placed into several compute cloud analysis facilities to allow for uniform mutation-calling and data mining by ICGC researchers. In the “Cancer Genome Collaboratory”, a project just approved in March 2014, we will be placing the entire ICGC data set into two compute cloud centers for access by the general research community. I will talk about the challenges and solutions that we are working on in connection to these two projects.

Notes #

ICGC Project

International Cancer Genome Sequencing Consortium
5th year of operation
multi-national collaboration
Includes all of the TCGA projects
Goal: Identify the common patterns of mutation in all major cancer types

Simple experimental design:

take normal (blood) and tumour (biopsy) samples from a series of donors
sequence
identify cancer-related mutations
relate mutations to tumor bio
translate this knowledge to improved diagnosis and treatment & make avail

ICGC db growing in size - moved from exome sequencing to whole genome

10K+ donors
4M+ somatic mutations
49K CNVs
6K+ methylation profiles

Available to public - Website @ http://dcc.icgc.org

very nice data browser
faceted view of various data types and donor types
changes in a context sensitive way
updates list with dynamically updated graphs/summary
links to raw data @ CGHub
view most mutated genes in selected cancer subtype. Can keep drilling down through stats/projects. Or look at summary - transcript level / protein level.

Original Database - based on BioMart

mysql based data mart - developed and used by EnSEMBL project
de-normalized data schema (reverse-star schema)
scaled well for human and other invertebrate genomes
worked well until release 12
One problem: as the data got larger, BioMart didn’t scale
Release 8 & 9: three month release cycle (freeze, prep, load, QC)
by release 11 - load phase taking 2-3 months! Missing release window. Were announcing new freeze before new db released

September - complete rewrite of entire dcc (Ferretti). Heavy use of distributed computing.

Process:

genome centres submit flat files + meta
validation (Hadoop cluster - HDFS distributed filesystem)
loaded into MongoDB (on cluster)
Combined w/ other info (gene annotation from Ensembl, uniprot, cosmic, etc)
Indexed by ElasticSearch (another cluster)
Indexed info stored in mongo - drives the portal
Total time for loading for release 15: 42 hours (not yet optimized)

What about raw read data?

~10 PB Genome data by 2018
depositing all genome data in EGA. In theory, researches go to EGA and dl data. In practice, data too large. Takes too long.
will soon be completely inaccessible - except maybe for some large groups, or those located in the UK
This is an important legacy dataset that can still be mined
Current mutation calling algorithms not perfect. Different groups have low overlap. Different filtering systems. Many false positives (e.g. titan). Our ability to predict gene rearragements quite poor.
want to go back to the data to get more info as our algorithms improve

The solution => The Pan-Cancer Whole Genome Analysis Project (PAWG/Pan-Can) #

Goal: understand what’s going on in the 95% of the cancer genome that isn’t protein-coding
Resources: 2K whole genome tumor/normal pairs from ICGC
Analytic issues: calling cancer mutations in non-coding regions is an evolving art. Need uniform pipeline. Dataset - 0.5PB.
Cloud based approach - six cloud compute centres in USA, Europe, Asia
Phase 1: Partition data among the data centres. Perform alignment and mutation calling in a distributed fashion
Phase 2: Synchronize alignments and mutation calls. Each centre will have complete set of alignmetns and mut calls
Phase 3: Open up (subset) of of clouds to allow researchers to do analysis

Technologies: OpenStack (5 centers) and vCloud (EBI)

Vagrant - vm abstraction layer (make clouds look similar)
network transfer and metadata - GNOS / GeneTorrent (from Annai Biosystems Inc) - commercial solution
Workflow management - SeqWare pipeline manager (OICR & UNC developed - O'Connor) synapse from sage

Status

Ethical approval, usage agreements signed - Legal
OpenStack/VMware, vagrant SeqWare installed
alignment workflows executed on some vms

Challenges

Legal - regional differences have not gone away. Datasets from TCGA (us) can be hosted by certain US based institutions trusted by NIH. NIH has not approved phase II of the project due to the way the consent was written. It can be interpreted as ‘not allowed to use on cloud’ (But cloud didn’t exist when the consent was written). Europe - some countries are sensitive to distribute their data to US based data centres (Snowden & NSA).
Technical - adapting grid based hpcs to use cloud-based technologies. Running 8 weeks behind

Why not a commercial cloud? Amazon, Google, MS

legal and ethical issues
preliminary ethics approval to ICGC. Some restrictions - can’t cross regulator borders without notice
NIH reviewing approval for TCGA sets

What happens when Pan-Can is done ~ 1 year? The group has received funding from Canadian funders: The Cancer Genome Collaboratory

long-lived private cloud compute centre, pre-populated with ICGC datasets
any individual can create an account and access the data via api
have an integrated benchmarking core, bioethics, community outreach
Initially two physical data centres (w/ Grossman in Chicago) & Toronto. Connected by high speed link
Funded as of March 1

Q & A #

Q: (Ware) Many of us have been using BioMart and the scalability - how portable is your new system as a replacement for BioMart?

A: on a scale of 0 - 100: -1. This is a highly specialized system designed just to work with our data. Biomart is alive and well in Italy

Q: What cancer types were chosen for the pan-cancer analysis? And why?

A: Our criteria for inclusion is at least 30x coverage for whole genome, tumor normal pair, proper consent from donor.
Of that, we have ovarian, breast, lung, pancreatic, liver, leukemias – about 13 in all
The final list of tumor types won’t be selected till we’ve qc'ed al the data and know what the distribution is

*Q: If the 10PB of data that will be generated will be harmful - look at quality compression and other *

A: No chance that we’ll be storing adn distributing full uncompressed 10PB. Actively benchmarking compression systems. Hopefully get it down to a few PB without loss of information

Q: What is the main objective of this project? Biological objective?

A: The main biological object - focusing on patterns of alteration in non-coding regions. E.g. know there are mutations in regulatory regions - we haven’t characterized.
groups looking at:

Looking at regulatory networks - interactions wiht coding regions.
Patterns of rearrangement
Evidence of insertion of known and unknown pathogens / virus that may be driving the tumours

Looking at this in a uniform way we’ll learn common mechanism and mechanisms that are distinct

Q: How willing are your users to get random samples in return as opposed to the full data? Plus confidence score

A: Key method of access - take slices of the raw data in the region that you’re interested in. Or extend and do a random sampling - feature available of CGHub and widely used. Not a feature of EGA - annoying deficit. One of the reasons we want to move away.

Q: Majority of researchers - don’t need to develop alignment algorithms. Are processed data available to researchers?

A: The interpreted data (still large, but much smaller - in GB not TB) is available for browsing and dl and abstraction and available from http://dcc.icgc.org

Q: Curious how you are designing your APIs? APIs for visualization are different from tools

A: Start with the user interface, figure out what it needs to display, and work back to the API. A genome browser has a very different api than the faceted browser where you’re looking at a particular biological pathway. Specialized APIs and indexes for each of those.

back to the speaker list →

The Genographic Project #

Genome Analytics with IBM Watson #

Ajay Royyuru, IBM T.J. Watson Research Center, USA

Director of computational biology

Abstract #

// last minute topic change, no published abstract

Press release: http://www-03.ibm.com/press/us/en/pressrelease/43444.wss

Notes #

Research group at IBM - very focused on computational biology.

Intersection of everything IT and Life Sciences.

3 pillars of work (IBM computational biology)

managing and analyzing the data explosion - makes biology more amenable to quantitative outcomes
predicting biological outcomes with scale of computing
dealing with complexity. DREAM - IBM team with community is heavily involved

Why:

Intruiged by connections made yesterday (DH, JM)
Sequencing is reaching a point where we have to look at the translational aspects
beginning to make an impact in teh clinic
takes a community
IBM Watson - can be used here
On IBM’s cloud system - rapidly scale. The sorts of analytics capabilities - it begins to be scalable and accessible so it can have the impact on the clinic down the road

What are we up to: Gathering raw sequencing input, through large number of steps so that we will eventualy get useful info that may lead to action

3 pillars in the journey of genomic medicine

sequencing (includes downstream analysis - variant calling)
translational medicine (have VCF) <– will focus on this piece (VCF to actionable)
Actionable intelligence - Personalized healthcare. Something publishable is our goal

Translational Medicine: #

System that generates insights

Input:

data coming from sequencing (VCF) - patient specific information
Entirety of what you can point Watson to - All available biological knowledge (PubMed, NCI PDQ)

All this is ingested. Running on IBM’s cloud layer (SoftLayer) - large/global/scalable/acquired by IBM.
Generates some actionable insights.
Goal: this goes to tumor oncologists, look at data in context of decision trying to make. Hopefully make informed correct decision.

IBM Watson #

began 2008 - research project
Jeopardy - grand challenge (got attention)
Added genomics capabilities!

Genomics - not just about genes. How we connect that knowledge #

The traditional way: read papers, develop hypotheses -> interpretation -> actionable output. Can we automate this? Can we come up with new research approaches from the literature?

p53 project example - ingest a lot - mine the literature. #

lots of text, natural language, analytics happening
specific to diseases, compounds (drug molecules)
Human readable sentences - use Watson based technology to translate the information into machine readable. ‘the results who that EPK2 phosphorylated p53 at Thr55’ - extract info with Watson
Extraction is working

Application to genomics:
on SoftLayer, physican managing cases (biopsy samples) submission - uploading VCF.
What analysis can be done -

circos representation, where they occur, where translate to
map to available info on pathways
what more can you find in liternature, Watson? - adds links (to literature) from text mining. Can drill down and find out why links were generated
Drugs - targetting pathways: added in datamodel

Summary: researcher can browse. print report for the record.

see provenance of the data and keep a record of it
see all visualizations, records, summary
possible list of all possible drugs, status (approved?)
this insight is available to the research

Looking for active collaborations - dont’ generate this data themselves

last week: partnership with NY genome centre (collaboration of research centres in NY area). Can take this technology and apply it with them. Get practical use of this technology
Not exclusive to NY genome, can open collaborations with others

Sample report- generated with early data

TCGA GBM data - reshaped to put in system
generated report (many pages long)
list of drugs with reasons why the drug is contextually relevant

e.g. Lidocaine in report: not prepared to see this in here

showed to oncologists - click through to evidence. Watson points to papers - Lidocaine assay on cancer cells (tongue, EGFR receptor). Lidocaine being tested in context of thyroid cancer cells
so this is not out of the realm of what we should be thinking about
helps us be current and comprehensive

Q & A #

Q: (Ouellette) Do you have any evidence on how Watson will do if it read full papers (not just abstracts)?

A: Not tested in this context. Watson does read full papers in a clinical context

Q: (Mesirov) -

1. Are you aiming with that package towards the practicing oncologists or the research physician?

2. To what extent have you compared what Watson is able to mine from the data with other approaches/algorithms/packages published and available to the community?

It’s a journey - early adopters, research clinicians who have the expertise and interest to be partners. A lot of learning. For example, Watson shows lots of evidence. You need a clinician research who understands the subtleties of the research and how to make decisions that will be useful
Not whole scale comparison yet - still in ingest and build mode. Some benchmarking and testing - working on the baseline. Full scale comparison for later. Watson can also do chemical extraction - full scale comparison here.

Is there any way to integrate other sources of information not text based? Images? Protein structures?
human value added in human curation databases?

Image analytics is an interest to us. Study going on here. Working with some large medical institutions on this project.
Melding between machine and human curation -> this accelerates the process. Makes it more usable.

Q: Doubts whether practicing physician will know what VCF is, understand Cicos plot? Watson to user or user to Watson?

A: Initial set of end users - clinician researchers. They got the sample, they know what VCF is. This is the community that will find this useful. What can we simplify to make this more useable.
Right now, collaboration.

back to the speaker list →

Human Genome Analysis #

Mark Gerstein, Yale University, USA #

Director: computational biology
ENCODE, 1000 genomes

Abstract #

Plummeting sequencing costs have led to a great increase in the number of personal genomes. Interpreting the large number of variants in them, particularly in non-coding regions, is a central challenge for genomics.

One data science construct that is particularly useful for genome interpretation is networks. My talk will be concerned with the analysis of networks and the use of networks as a “next-generation annotation” for interpreting personal genomes. I will initially describe current approaches to genome annotation in terms of one-dimensional browser tracks. Here I will discuss approaches for annotating pseudogenes and also for developing predictive models for gene expression.

Then I will describe various aspects of networks. In particular, I will touch on the following topics: (1) I will show how analyzing the structure of the regulatory network indicates that it has a hierarchical layout with the “middle-managers” acting as information-flow bottlenecks and with more “influential” TFs on top. (2) I will show that most human variation occurs at the periphery of the network. (3) I will compare the topology and variation of the regulatory network to the call graph of a computer operating system, showing that they have different patterns of variation. (4) I will talk about web-based tools for the analysis of networks (TopNet and tYNA).

http://networks.gersteinlab.org

http://tyna.gersteinlab.org

Architecture of the human regulatory network derived from ENCODE data.

Gerstein et al. Nature 489: 91

Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors.

KY Yip et al. (2012). Genome Biol 13: R48.

Understanding transcriptional regulation by integrative analysis of transcription factor binding data.

C Cheng et al. (2012). Genome Res 22: 1658-67.

The GENCODE pseudogene resource.

B Pei et al. (2012). Genome Biol 13: R51.

Comparing genomes to computer operating systems in terms of the topology and evolution of their regulatory control networks.

KK Yan et al. (2010). Proc Natl Acad Sci U S A 107:9186-91.

Slides #

http://lectures.gersteinlab.org/summary/Big-Data-in-Genome-Annotation-Using-Networks–20140325-i0keybdata/

Notes #

My perspective on Big Data #

buzz word, data science
HBR - data science the sexiest job of the 21st century (http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ar/1)
transforming science
explosion of data in genomics - sequencing price going down faster than Moore’s law. Cost is in management of data
Current state of large sequencing dataset TCGA 910 TB in CGHub, + smaller datasets

What do people do with big data? #

Take this data to answer a question, make a prediction, modelling

Two ways to approach:

don’t care about structure, just want answer (google search)
with explicit organization of dataset (google maps, google earth)

In science - search for Higgs boson - searching through many for a few needles (fits in #1)

In genomics - we’re in #2

we want to make a map of the molecular world we have
but we don’t have an immediate metaphor we can hang all our information on
but we don’t know what the structure of that map is
ENCODE - thought about the structure of the map. Layer information down
genomics has been around for a while - one of the first big data disciplines. Inspired by pandora - music genome project which was inspired by how geneticists organize information. We should learn from other disciplines

How we can organize information in genomics - networks #

regulatory networks as a hierarcy
more connectivity - constraint

What is genome annotation? #

Tracks in genome browser - linear view of how to think of genome.
How will this scale with thousands of tracks? No

What type of information do we want? Actually thinking of 3D molecules - but not quite possible

Network diagram - middle ground

works for cancers/biology pathways
compelling approach to big data
Example: we started off with linear annotation (ChIP-Seq experiments)
Then, created proximal edge at peaks.
Generated a hairball of .5million edges, paired down to 25K edges.
Many edges far away from genes - distal sites.

analyze networks - network science

Hub - point with many neighbours
bottleneck - max # of shortest paths
Identify bottlenecks & hubs (like roads, bridges can be bottlenecks)

Directed entities - regulatory networks

one thing regulates another
Hierarchy - intuitive - people understand this
optimally arrange transcription factors (ENCODE) into 3 levels by simulated annealing, maximizing downward pointing edges
higher bottleneck-ness in centre layer - information flow
Can think about molecules - does this make sense for molecules.
Integration of TF hierarchy with other ‘omic information.
More connected and influential on top
Same thing with miRNA networks (bi directional)
Can look at how transcription factors are working together. Pick two, can look at the degree they co-regulate the target

Other organisms: Yeast genome #

Similar, but has four levels. Multi-regulated network with bottlenecks

Different types of hierarchies

autocratic (military)
democratic (things at top mostly regulating, bottom mostly being regulated)
intermediate - between the two. Ease some information bottlenecks

Developed a scheme to measure the degree of x-linking structure. Degree of collaboration

number of overlapping
find over many organisms: get a lot more confidence that inclusions are true
middle layer has highest degree of collaboration

Compare humans w/ E. coli & yeast & rat: humans more collaborative nodes

Yeast network similar structure to government hierarchy w/ middle managers: matches gov’t of Macao

Social science - literature on people studying how important you need middle managers talking to each other

Variation network

map all SNPs in 1000 genomes on network
more SNPs at bottom
higher parts of hierarchy more conserved, less variable
Trend: more hubs - less variation/ more connectivity, more constraint.
Seen in many studies/organisms.
Human protein-protein interaction network - rapidly changing on the outskirts

Analogy to understand more connectivity -> more constraint #

Comparison between e. coli regulatory network and Linux OS

call graph in linux compared to e. coli regulatory network
linux is top heavy in comparison
E. coli: dominated by out degree hubs - turn on a lot of molecules
linux: dominated by in hubs - routines called by many programs
linux OS evolves - we can watch it through each of its releases
plot changes & compare.
E. coli: less change.
Linux: certain that don’t change, some things change constantly. Some releases coupled to hardware, has to change
In biological system - negative correlation connectivity is less change
In linux - positive correlation - connectivity is more change
Perspectives on random change v. Intelligent Design.
Intelligent designer - they believe they can make changes where there is a lot of constraint and connectivity.
If changes are random - best to not put them in central points

Applications of more connectivity leads to more constraint - no time to talk today. Building a practical workflow & tool for disease genomes.

Network stuff available - encodenets.gersteinlab.org

Q & A #

Q: (Stein) you showed this relationship between Hub-ness and Kernel call graph. Have you looked at the evolution of the call signature? Highly connected subroutines do not have their call signature called frequently - more similar to bio

A: No, very interested in that. Evolution - even package dependencies.

Q: Information flow: makes sense in regulatory networkers. What’s your reasoning with protein-protein networks?

A: Some times of protein-protein interaction networks, but other times not so much. Key network params - regulatory, focused on bottlenecks. Protein-protein - focus on hubs. When you do the correlations of connectivity with constraint - more on bottlenecks.

Q: Interested in E. coli v. linux - we compare a lot to engineering ideas

A: Maybe not a lot of engineering ideas apply to biology. Sometimes people look at biological networks to apply to engineering problems

Q: have you looked at hubs in organisms with recent genome duplications to see how they occur?

A: genome duplicates, suddenly have these two things interact with your hub or what’s there. Lots of network literature on scale free networks - plays into that.

Q: What do you think about the cell type specificity - do you think different cells depending on their needs will have different hierarchies?

A: Controversy in how I present this. Cell type non-specific hierarchy - this is a global wire diagram. In my mind, if you go to certain cell time, certain lights turn on. Other view - cell type specific hierarchies. I think this doesn’t make sense - no one talks about gene list

back to the speaker list →

The BioCompute Farm: Colocated Compute for Cancer Genomics #

Stuart Young, Annai Systems Inc., USA #

Abstract #

Pedabyte-scale genomic data repositories such as the Cancer Genomics Hub (CGHub) require collocated compute resources to fully leverage the value of the genomic data. The traditional model of data download from a repository to a research center followed by local computational analysis suffers from high file transfer costs, significant delays and file storage problems. The BioCompute Farm, a highly-scalable computing resource colocated with CGHub, provides a 99.9% reduction in data storage and 120 times reduction in time for analysis of all 40TB of the current Cancer Genome Atlas (TCGA) RNA-Seq data set. The BioCompute Farm combines high-speed BAM slicing for DNA analysis and the latest in bioinformatics tools and standardized pipelines with the flexibility to customize pipelines and rapidly scale up computational capacity to meet the needs of cancer researchers. As data growth continues to outpace the growth of Internet bandwidth, the BioCompute Farm can serve as a model for the emerging paradigm of colocated compute resources serving the users of large genomic databases.

Notes #

Motivation for talk: why colocated compute #

'07/'08 - next gen suddenly became a viable product
before this, fairly expensive Sanger sequencing
soon - began to overshoot the cost of storage and bandwidth
only will become worse
to address this: need to provide a solution to provide capacity and service

Annai systems: director of bioinformatics #

Software underpinning CGHub - Annai-GNOS
server to genetorrent - dl sequences
bioCompute -colocated w/ CGHub

How big is this problem?

TCGA data ~ 1PB, -> 2.5 in the next few years
download rates: several months to download it all. Store it. Need infrastructure.
researches limited by financial and logistical constraints (IT)

Survey by NCI - wish list for cancer genomics researchers

#1 Run workflows on data in cloud (13%)
Annai covered about 50% of what they want. Maybe biased sample (online)

NCI’s colocation model

Genomic Data Commons - integrate multiple datatypes, provide API
Cloud Pilots - $20M, colocated compute. The successful bidders will provide workflows and be scalable

BioCompute Farm (TCGA data)

what they’re doing with sequencing - shifts cost of sequencing to getting data and results out
upstream costs: technology development, pipelines, bioinfo tools
downstream costs: tools for sequence analysis, management of

/// LOST CONNECTIVITY FOR AWHILE/ //

HIPPA Compliance

wholistic expectation - bookkeeping where access is controlled
Physical security: Cage in SDSC - monitored, power, alarms

Provide farms with subscription based access

Provide custom analysis

farm loaded with standard pipelines: broad GATK, PanCancer BWA alignment
Custom Pipelines - latest versions
Workflow tools: SeqWare (O’Connor), agua, synapse
Use Case Baylor - BAM-slicing of TCGA RNA-Seq data
- would have taken 9weeks of dl time + storage (no capacity)
- They used biocompute farm, used bam =0slicing of CGHub bam files on Annai’s GTFuse
Pipeline Optimization - look at runtimes, will this benefit w/ parallelization or throwing more cpu?

Collaborations #

PanCancer project

prototype of global federated colocated compute
setting up servers, SeqWare,

DREAM challenge

variant calling
Annai provides GNOS platform for data security and download

ShareSeq

hosting ICGC- common free access to download free data
provide colo-compute

Conclusion #

colo compute is a no brainer
useful functionalities - fast access, flexible use, tools for workflow, and custom analysis and scalability

Q & A #

Q: Only 5 or 10 labs in the world are interested in whole PB scale data. I think if we make the VCF file available - this should be sufficient for most researchers.

A: I think with the way things are going, the issue is not only going to be huge data access, but secure access, and how can we search through the data to find the datasets you want.

Q: Most of the pipelines are focused on variant calling, alignments - what are the priorities for what’s next?

A: Yes, it’s variant calling right now. One other area of interest- systems approach, pathways, integrating different types of data. Looking at different standards, read pathology or clinical data. Hospital data is very rich for researchers, but not very accessible. Looking at integrating with genomic data.

back to the speaker list →

Short Talk: An Overview of the Bionimbus Protected Data Cloud #

Robert L. Grossman, University of Chicago, USA #

Abstract #

Bionimbus is a petabyte scale community cloud for managing, analyzing and sharing large genomics datasets that is operated by the not-for-profit Open Cloud Consortium. With a cloud computing model, large genomic datasets can be analyzed in place without the necessity of moving it to your local institution. Bionimbus contains a variety of open access datasets, including ENCODE and the 1000 Genomes dataset. In 2013, we updated Bionimbus so that researchers can analyze data from controlled access datasets, such as The Cancer Genome Atlas (TCGA) in a secure and compliant fashion. We describe some case studies using Bionimbus, some of the bioinformatics tools available with Bionimbus, some different ways of interoperating with Bionimbus, the Bionimbus architecture, and the security and compliance framework.

The Bionimbus Protected Data Cloud is supported in by part by NIH/NCI (grant NIH/SAIC Contract 13XS021 / HHSN261200800001E), the Gordon and Betty Moore Foundation, and the National Science Foundation (Grants OISE - 1129076 and CISE 1127316).

Notes #

I’m going to pose a few questions. In the next 10 min I will not try to answer them. Hopefully your answers will be more interesting than mine. I will give you a framework of how we think of big data.

Four questions #

Is big data in bioinfo/biomed any different than big data in science. Is big data in science any different from big data general?
what instrument should we use to make discoveries over big biomed data?
do we need new types of mathematical and stat models for big biomed data?
how do we organize our data?

Bionimbus protected data cloud #

Supporting Pan-Can analysis - open source core

interoperate with as much proprietary as they can
log in with NIH/eRA credentails - immediate access to TCGA data
pipelines, analysis, install your own software

Right now process of scaling up

10-20 projects a month
contain TCGA data- operate on PB scale
sometime next week, another PB of data & 16K cores, ICGC Pan-Can analysis
question: how do we make sure, on this limited resource, we get the most science out?. Traditionally handled by allocation committees
this month, would have cost >$3K on amazon

Open science data cloud #

support integrative analysis: Can look at how disease is impacted by socio-economic factors and more. Text analytics & geospacial analytics
4 years old (Bionumus 1 year)

biomedical commons cloud

involves cancer centres, open source core but operates with proprietary software around it
want to peer at scale with other providers (biomed commons providers)
like how internet was started with tier one ISPs
sometimes faster to get data at high performance network than over disk with certain protocols

New era #

'05-'15: bioinformatic tools and integration (Galaxy, GenomeSpace, workflows, portals)
'10-'20: data center scale science (Bionimbus, CGHub, cancer collaboratory). At that scale what changes and how do we build models
'15-'25: new modelling techniques

What are the new models? '72 phil anderson wrote a piece: is more different

http://robotics.cs.tamu.edu/dshell/cs689/papers/anderson72more_is_different.pdf
up to us to decide if is more the same and if it is how do we model that
backlash on google flu

How do you scale machine learning to data centers?

take large complex datasets and chop them up in small pieces you can analyze at scale

Is more different at this scale? And if so, how do we discover it?

Q & A #

Q: (Ware) as you see these data centres emerging, do you think they’ll focus on specific questions? How do you see the data centres forming?

A: The ones I mentioned are around cancer genomics. Sustainability and payment - putting small taxes on certain of our projects so that we can make larger amounts of our data available. Driven by some funding agencies. There’s a certain interest of private donors funding certain parts of this. Some economic incentives. Some combination of that is going to change the way we do science.

back to the speaker list →

Short Talk: Pan-Cancer Analysis of Somatic Variation from Whole Genome ICGC / TCGA Datasets #

Adam Butler, Wellcome Trust Sanger Institute, UK #

Abstract #

The advent of massively parallel sequencing technology has revolutionised the way we characterise cancer genomes and provided new insights in our understanding of the mechanisms of oncogenesis. The International Cancer Genome Consortium (ICGC) was instigated in 2007 with the aim to systematically screen hundreds of Cancer Genomes for 50 distinct tumour types and catalogue the somatic variation present. This endeavor aims to prevent duplication of effort, ensure rare tumours are included and generate large datasets for the scientific community. A similar project is underway in the USA, The Cancer Genome Atlas (TCGA).

In late 2013 at the ICGC conference in Toronto, Peter Campbell announced an ambitious plan to undertake a Pan-Cancer analysis of whole genome data available from ICGC and TCGA. This would provide a comprehensive dataset of somatic variant calls with standardised output for 2,000 cancer genomes, which will be available for subsequent downstream analyses.

The primary analysis will include detection of somatic point mutations, small insertions and deletions, copy number changes, rearrangements and retrotransposon/viral integration sites. To ensure integrity of the dataset, three independent analysis pipelines, provided by the Broad Institute, DKFZ and the Sanger Institute, will be utilised. The data will be generated and stored at 6 data centres around the world; Spain, Germany, Japan, UK, and two centres in the USA.

The Sanger Institutes contribution to this initiative is to provide our analysis pipeline as one of three to be run over the data. Consequently our algorithms have been assessed via rigorous comparison with comparable software and their performance optimised. The pipeline is currently being ported into a VM (Virtual Machine), automated and the code adapted for running all variant detection analyses within a cloud environment.

The primary analysis will deliver a high-quality catalogue of somatic variants in a standardised VCF format and made available from the six centres for downstream investigation.

Notes #

Go over our part and experience with the Pan-Cancer analysis with large datasets

The Cancer Genome Project #

2000 - working through Sanger sequencing, then next gen '07
In order to handle different datasets - build analysis tools and pipelines and system
use them to this day to analyze
heavily integrated into Sanger infrastructure. Now have to look at with bigger scale data

Pipeline:

BWA alignment
Tools: copy number caller - ASCAT - ins/del, rearrangements, transposon, RNA-Seq pipeline
generate VCF, BAM, allow researchers to get useful parts of info and drill down

PanCancer - large international collaboration #

2K genome pairs (4K genomes) from multiple tumour types, 30x coverage
uniform dataset
analysed using 3 pipeline (Broad, DKFZ, Sanger)

CGP -> PanCancer

need to take out each part and make it Sanger free
optimize for different version of aligner
pipeline whole lot using SeqWare (O'Connor)
Just a few seconds - but they add up over few billion bps

Phase 1

identify data for upload, align each sample pair
using GeneTorrent to dl data from CGHub - works very well. Personal concern was on getting data from where it was to where it needed to be. Getting astonishing transfer rate. Automatic data upload.

Useful outcomes #

we moved over to using a version of BWA-MEM (from BWA)- significantly faster and smaller memory footprint. May use for in-house pipelines

optimized callers

looked at where their code was spending time
made huge steps forward - substitution caller is 50% faster
indel caller 2x faster
ICGC benchmarking exercise - invaluable. Allowed us to make much better judgements on how well we are doing
new sequencing technologies go faster still…

Q & A #

Q: (Ware) interested in optimization for indels - can you push that any further? Many of our bottlenecks are in aligners built for human (work in plant)

A: What’s it written in? Perl/Java - eyes roll back in heads and they start shaking. Joking aside, with Caveman (substitution caller) - given someone the time to go back and just re-code proved to give us massive improvement. Recoded in C. Not glamorous or groundbreaking - C really is faster.

back to the speaker list → #

Short Talk: Extensive Variation in Chromatin States Across Humans #

Maya M. Kasowski, Yale University, USA #

Abstract #

The majority of disease-associated variants lie outside protein-coding regions, suggesting a link between variation in regulatory regions and disease predisposition. We studied differences in chromatin states using five histone modifications, cohesin, and CTCF in lymphoblastoid lines from 19 individuals of diverse ancestry. We found extensive signal variation in regulatory regions, which often switch between active and repressed states across individuals. Enhancer activity is particularly diverse among individuals, whereas gene expression remains relatively stable. Chromatin variability shows genetic inheritance in trios, correlates with genetic variation and population divergence, and is associated with disruptions of transcription factor binding motifs. Overall, our results provide insights into chromatin variation among humans.

Notes #

Chromatic variation among people

What makes people different? #

Level of DNA sequence - SNPs
But how do these variants translate to phenotypic differences
Look at gene expression. Look at differences in chromatin
Mapped NFkB

Differences in histone marks differences in gene expression? #

Aim:

Characterize variation in chromatic state
Genetic basis, functional consequences

Used HapMap populations - 19 individuals

9-13 histone marks - deeply sequenced data
Convenient - powerful tool for functionally annotating genome
Enhancers/promoters/ etc

How much variation in chromatin among individuals? #

There’s an enhancer that is active in caucasian and 2 asians, but not africans - SNP in NFkB motif

Striking variation - more than 30% variation at some marks

Combinatorial - chromatin states based on combinations of the marks

promoter states
transcribed states
variety of enhancer states
repressed states

Found that it was more meaningful to ask whether a particular mark varies in the context in a particular state than overall

looking at active enhancer mark - varied more in enhancer state than promoter state
state specific variability
enhancer states more variable than transcribed or promoted
repressed mark - varies more in combination with active marks than on its own

Do states switch among individuals?

not the case, enhancer is an enhancer across individuals.
some reciprocal states

Genetic basis of variation

Active enhancer mark - evidence of strong genetic basis. Strong correlation to genotype to variable than non-variable
Family trios: heritability. found that the extent of varience in daughter correlates to parents

Possible mechanism - differences in TF binding motifs

Strong evidence of this
Link variation to specific motif disruption
Looked at peaks, ENCODE

Functional consequences:

There’s a strong correlation with gene expression (active enhancer - RNA-Seq data). For known enhancer gene lengths (but imperfectly known)
Not all enhancer variation influences expression (but most of them were). Why? - the enhancers are buffering each other. Non-consequential enhancer variation
Chromatin variation is likely to influence phenotypes. Variant regions enriched in eQTLs and GWAS SNPs

Q & A #

Q: (Ware) epigenetic change- were you able to use those as biomarkers and retest GWAS? Uncover hidden variation?

A: Haven’t look at that. This study, 19 individuals. But as we up the scale, perhaps.

Q: Did you look at the trios to see if there’s more concordance among their epigenetic marks than you would have expected on the basis of shared SNPs?

A: Didn’t look at that, we had two trios.

back to the speaker list →

Other posts in this series: #

Personal Genomes #

Lincoln D. Stein, Ontario Institute for Cancer Research, Canada #

Notes #

The solution => The Pan-Cancer Whole Genome Analysis Project (PAWG/Pan-Can) #

Genome Analytics with IBM Watson #

Notes #

Translational Medicine: #

IBM Watson #

Genomics - not just about genes. How we connect that knowledge #

p53 project example - ingest a lot - mine the literature. #

Mark Gerstein, Yale University, USA #

Slides #

Notes #

My perspective on Big Data #

What do people do with big data? #

How we can organize information in genomics - networks #

What is genome annotation? #

Other organisms: Yeast genome #

Analogy to understand more connectivity -> more constraint #

Stuart Young, Annai Systems Inc., USA #

Notes #

Motivation for talk: why colocated compute #

Annai systems: director of bioinformatics #

Collaborations #

Conclusion #

Robert L. Grossman, University of Chicago, USA #

Notes #

Four questions #

Bionimbus protected data cloud #

Open science data cloud #

New era #

Adam Butler, Wellcome Trust Sanger Institute, UK #

Notes #

The Cancer Genome Project #

PanCancer - large international collaboration #

Useful outcomes #

Maya M. Kasowski, Yale University, USA #

Notes #

What makes people different? #

Differences in histone marks differences in gene expression? #

How much variation in chromatin among individuals? #

Other posts in this series: #

Now read this

Big Data in Biology: Databases and Clouds