April 1, 2014

Big Data in Biology: Databases and Clouds

Series Introduction: I attended the Keystone Symposia Conference: Big Data in Biology as the Conference Assistant last week. I set up an Etherpad during the meeting to take live notes during the sessions. I’ve compiled all the abstracts, notes and slides (where available) here. Shout-out to David Kuo for helping edit the notes.

Warning: These notes are somewhat incomplete and mostly written in broken english

Databases and Clouds #

Monday, March 24th, 2014 9:30am - 2:15pm

http://ks.eventmobi.com/14f2/agenda/35704/288348
http://ks.eventmobi.com/14f2/agenda/35704/288348

Speaker list #

Laura Clarke, European Bioinformatics Institute, UK

The 1000 Genomes Project, Community Access and Management for Large Scale Public Data -
[Abstract]
[Q&A]

Dan Stanzione, University of Texas at Austin, USA

The iPlant Collaborative: Cyberinfrastructure for 21st Century Biology -
[Abstract]
[Q&A]

Jill P. Mesirov, Broad Institute, USA

GenomeSpace: A Community Web Environment for Genomic Analysis Across Diverse Bioinformatic Tools -
[Abstract]
[Q&A]

Ronald C. Taylor, Pacific Northwest National Laboratory, USA (replaced by Francis Ouellette)

FGED: The Functional Genomics Data Society -
[Abstract]
[Q&A]

Andrew Carroll, DNAnexus, USA

Insights from the Genomic Analysis of 10,940 Exomes and 3,751 Whole Genomes Demystifying Running at Scale and the Scientific -
[Abstract]
[Q&A]

Michael Schatz, Cold Spring Harbor Laboratory, USA

The Next 10 Years of Quantitative Biology -
[Abstract]
[Q&A]
[slides]

The 1000 Genomes Project, Community Access and Management for Large Scale Public Data #

Laura Clarke, European Bioinformatics Institute, UK #

Abstract #

The 1000 genomes data continues to be the largest public variation resource available to the community. Providing coherent and useful resources based on this data continues to be a key goal for the project Data Coordination Center (DCC).

The resource now stands more than 500 Tbytes in size and nearly 500,000 files on the ftp site this presents challenges both for us to manage and for users to discovery what data we have available.

Here I both describe these challenges and present the solutions and tools the project has created to enable the widest level of usefulness for the 1000genomes project data.

http://www.1000genomes.org/

Notes #

1000 genomes project #

Largest human project
Aims:
- complete a baseline of human variation
- all variation - at 1% MAF of higher genome wide.
- 0.1%-0.5% MAF in exonic regions
- structural variations as well as SNVs
BAM and VCF formats started on this project
99% of all variation in an individual is already present in the public catalogue
sequenced 26 populations around the globe. Started with HapMap, nhgri helped get more
collaboration - 10 different sequencing centres. many analysis groups

Strategy

collect shotgun reads, align to reference
detect variations based on alignment from all samples. statistical issues for allowing errors in sampling
in 2008 this was impossible at scale

Analysis Approach

final phase 70bp+ illumina. take much more complicated variations and create phage genomes
multiple centres, multiple technologies

In final phase now

technologies progressed so rapidly, can change aims in the duration of the project
0.5 PB of data

Challenges #

Data Transfer

FTP site growing
20TB 2009 – 580 TB today
synchronizing challenging
download speeds. Aspera (propriety). Download and upload clients

Within Consortium Data Exchange

Data Freezes
- stable release of sequence data
- dated sequecne index file
- alignments based on this index
- variant set calls created from these BAMs
Machine Readable FTP Site: Text file which points to FTP
Standardized naming formats: used sample and population names and what programs/technologies used
Regular communication

Public Accessibility

FTP site - raw data files ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/
AWS Amazon Cloud
web site
ensembl browser

Tools to Assist Data Use

Data slicer
- slicing remote BAM or VCF files
- web front end of samtools
- returns subsection of given file - subset by population, individual
Variant Pattern Finder
VCF to PED: haploview (PED)
Ensembl Variant Effect Predictor
- Predicts functional consequences of variants - SNPs, Indels, Structural Variation
- Web & API based
- Can provide Sift and PolyPhen, HGVS, Refseq gene name
Population Allele Frequency Tool (coming soon!): range of variations

Q & A #

Q: 1000 genomes project - many 340bp all deletions without insertions?

A: Quality - false discovery rate <5%. Sturctural variant very difficult. Wasn’t sufficiently confident in structural variations that aren’t deletions - did not include in db. Structural variations will always be more limited.

Q: Idea of a data freeze and recall - uuid, public key trust network - possible route?

A: sounds like a good idea

back to the speaker list →

The iPlant Collaborative: Cyberinfrastructure for 21st Century Biology #

Dan Stanzione, University of Texas at Austin, USA #

Abstract: #

iPlant is a new kind of virtual organization, a cyberinfrastructure (CI) collaborative created to catalyze progress in computationally-based discovery in plant biology. iPlant has created a comprehensive and widely used CI, driven by community needs, and adopted by a number of large-scale informatics projects and thousands of individual users. iPlant holds more than 1.5 petabytes of user data comprising several hundred million files today, and is thus deeply involved in the “Big Data” challenges of biologists, from storing to analyzing to sharing rapidly growing amounts of data.

This talk will outline the iPlant CI, and discuss what iPlant is doing today to address data challenges, as well as plans for the future. The talk will also address trends the project sees in how users are handling data, and the potential technological solution on the horizon to address them.

iPlant is supported by the National Science Foundation via Award #DBI-1265383.

https://www.iplantcollaborative.org/

Notes #

iPlant - co-director (until 8 weeks ago). Passed co-director to Matthew W. Vaughn

What is iPlant:
community driven organization building cyberinfrastructure for the plant (and animal) science

cyberinfrastructure #

combination of computing, data storage, networking and humans.
to achieve some scientific goal

iPlant #

6th year
14K researchers access services or data - from ecology to epigenomics

Achievements through iPlants open infrastructure

BIEN - generate range maps for species
1KP project - 100M sequence reads - richer tree of plant data. blast annotation
animal mandate - cattle/buffalo piplines
GWAS and more

iPlant Services #

Atmosphere - on demand cloud computing: friendly front end for cloud - web interface. pick images. can log in via shell to image
iPlant data store
discovery environment. rich catalog of bioinformatics machines/tools you can choose from. put together pipelines - gui
iPlant APIs: embed iplant CI capabilities
foundation of computation by TACC
TACC: one of the worlds largest data providers. provides a comprehensive cyberinfrastructure ecosystem. not just machines, tools, apis, team

build your own informatics project!
rPlant - r project built on iPlant
araport - se iplant services

Workflow Optimization and Consulting

12 year analysis - down to 3 days on cluster, working with iPlant
Code optimization: PINT - write code in R, rewrote it done in 4h

Democratizing access to high-throughput genome annotation

Data store: #

federated sources iRODS (DFC) - AWS
geographic replication - U of Austin and TACC
600 TB user data and growing
700 TB Galaxy
200 TB specieal projects
community collections
100GB in 27min - UCBerkley to UA
Evolving the Data Strategy: open file storage, few roles. iDS - some filetype detection, manual meta data tagging, elastic search
Scaling for team science: easy scaling when too large for laptop to open

Big Data Observations #

About 5B files at TACC - 3.5 more than Jan 2013
We delete at least 300M files per month
About 30PB in use
file count and size increasing rapidly
95% of I/O operations don’t actually move data

Soap Box

Average practice is getting worse in data transfer, file i/o and programming
best practice- amazing! - 1,024 core job, generate 1PB in 2h, reanalyzed dozen times < day. good user, know what they’re doing
worst practice - 128 core job- generated 80x metadata traffic of above job and crashed filesystem. moving 1PB over a 10GB/s network via http will take about 1.4 years
c: f=fopen(“file.txt”, “w”); //3 metadata writes
python: f=open(‘file.txt’, ‘w’) // 17 metadata writes
Cloud lets us do stupid things we do in software and run it on a large scale

Speed things up

Technological solutions are coming that can meet demand
machine learning, data transfer can help speed things up. But we still need good software

Q & A #

Q: (illumina) Are there tools to analyze applications to determine their lack of efficiencies?

A: Yes, there are. Caveats: some tools - perfexpert (tooling and analysis) - low level performance tools. Not as useful with non-low level languages. Not great for python.
Build job stats on system - can tell you efficiencies of your code on their system.

Q: (Mesirov) What’s your process on who gets to use it, who doesn’t?

A: iPlant: all resources NSF funded. some EXSEED. xrack - any open science funded researcher. Must be US and published.
iPlant - will open up under 10K hours. tiers on higher use, compare with other users.

back to the speaker list →

GenomeSpace: A Community Web Environment for Genomic Analysis Across Diverse Bioinformatic Tools #

Jill P. Mesirov, CIO at Broad Institute, USA #

Abstract #

Over the last two decades genomics has accelerated at an exponential pace, driven by new sequencing and other genomic technologies, promising to transform biomedical research. These data offer a new era of potential for the understanding of the basic mechanisms of disease and identification of novel treatments. Concurrently, there has been a growing emphasis on integrating all of the available data types to better inform scientific discovery. There are now thousands of bioinformatic analysis and visualization tools for this wealth of data. To leverage these tools to make biomedical discoveries, biologists must be empowered to access them and combine them in creative ways to explore their data. However, this vision has been out of reach for almost all biomedical researchers.

We will describe and give example applications of GenomeSpace, http://www.genomespace.org, an open environment that brings together a community of 14 diverse computational genomics tools and data sources, and enables scientists to easily combine their capabilities without the need to write scripts or programs. Begun as a collaboration of six core bioinformatics tools - Cytoscape (UCSD), Galaxy (Penn State University), GenePattern (Broad Institute), Genomica (Weitzmann Institute), the Integrative Genomics Viewer (Broad Institute), and the UCSC Genome Browser (UCSC) - the GenomeSpace community continues to grow. GenomeSpace features support for cloud-based data storage and analysis, multi-tool analytic workflows, automatic conversion of data formats, and ease of connecting new tools to the environment.
Funding provided by NHGRI and Amazon Web Services

Notes #

GenomeSpace - fairly recent project #

Background

accelerated rate at which biological data acquired. enabled us to do all sorts of global analysis projects
Swamped by development of next gen sequencing technologies
availability of this data has led to progress towards goals to understand disease at the molecular level and understand the genetic basis and mechanisms for disease
now know 3K mendelian disease genes and 5K loci have been associated with over 6K common diseases and traits
ENCODE- all functional elements of genome and dark matter
ICGC/TCGA tumour types

New Trends #

cost down, methods up
more types of data are acquired
miRNA, Copy Number, microRNA, epigenetic- methylation, RNAI. more sensitive and less messy data
increase in integrative approaches. leveraging all these kinds of data
more large-scale projects (x-lab, x-institution)
moved from single gene analysis -> pathway/network view. how genes really work

What do we need to take advantage? #

integrate large data sets and multiple data types.
data management/identification - how do find what helps me?

more complex workflows and algorithms

increasing computational complexity
compute power demands
need to interoperate methods and tools
available and accessible to biologists: in a more friendly way. can’t be just the computational cadre - but whole community

visualize large integrated data sets:
viewers, help us look at reads and see if that call makes sense

validate computational results

Will focus on -> More complex workflows/algorithsm #

interoperate methods and tools
available to all

Integrative genomics

tremendous advances last 10 years
by integrating lots of different kinds of data

Difficulty of getting these tools to work together - need to develop infrastructure.
Challenge: flood of data & proliferation of tools

tools don’t always play well together, want to use them all in one place
2012: 7-10K bioinformatics tools on the web. just Broad - 60-70 tools. not counting internal tools
5K public databases
use case (breat cancer): 12 steps, 6 tools, 7 transitions
- transitions -> data formats different between tools
- how can we democratize this data analysis and bring to the rest of the community?

One monolithic tools OR cooperative approach

lightweight layer for interoperability with automatic data transfer. lightest weight possible - do data transfer for the users
leverage multiple groups and existing tools
access to familiar tools with usual look and feel. so users don’t have to learn how to use them again

GenomeSpace: #

shared vision of 6 bioinformatics tools. get them to talk to each other very easily
have it live in the cloud - server in cloud. talks to GS data sources or components
14 tools right now (4 or 5 on the way). infrastructure at a place where the new tools were enabled in ~1 programmer day. portals: access portals from genome space (eg IM)
Use GenomeSpace S3 storage or add your own Amazon account. Dropbox can be connected. in development: OpenStack & Google Drive

How do I use it? #

Go to cookbook for: how to build a more complex analysis,
How to leverage these different tools

genome space recipe collection

summary of what the recipe does & high level steps and tools
summary of workflow and steps in recipe
video of someone going through the recipe
more detail on recipe - real biological use case
walk through a protocol of all detailed steps
easy to use!

Join the community! http://www.genomespace.org/ #

open source, on bitbucket https://bitbucket.org/genomespace/

Q & A #

Q: (Stein) loved the recipes. Regular recipes still work 50 years later (broccoli doesn’t change). Bioinformatics paper 10 years ago will not work. How much time and effort is required to create a recipe in an environment where tools will be updated? Will it work in 5 years?

A: Tried to limit the scope of the recipes - not beginning to end paper. More simple - just 2 or 3 tools. Committed to setting up steering committee for recipe collection to keep them honest.
RNASeq - many are beginning to use in their work. Yet - methods for analyzing RNASeq hasn’t been settled. Challenge they recognize. Community resource - users can report when recipes aren’t working. Go to forum.

Q: (illumina) Data from different sources, does GenomeSpace provide info on challenges on combining different data?

A:

Can do: put warnings. Watch out for the follow… etc. People who develop these recipes much understand the workflow fairly well so they know the gotchas.

Can’t do: cannot anticipate all the ways in which a biologist will misuse resource
People mis-use tools. Try to give enough info and warning to keep the probability low.

Q: followup: Account for differences in platforms?

A: Don’t have funding for all, but we do contact vendors.

Q: Thank you for making something more user friendly!

Q: Clinical data - do you have the security to handle this?

A: Security that Amazon Cloud provides. New round of funding: agreed to put warnings for ppl who are uploading data. If you have data that needs to be kept private - can use your own Amazon S3/Dropbox.

GenomeSpace does not do analysis - it’s on the tools.

Q: (IBM - Royyuru) Reproducibility - read about a tool in a paper, but can’t reproduce. Can GenomeSpace add machine readable script to run the tool?

A: Can’t go into tools themselves - lightweight. Will talk offline.

back to the speaker list →

FGED: The Functional Genomics Data Society #

Francis Ouellette, Ontario Institute for Cancer Research, Canada #

(Replaced: Ronald C. Taylor, Pacific Northwest National Laboratory, USA)

Selected on merit - not invited talk. Ron has laryngitis - Francis Ouellette is presenting slides.

Abstract #

The Functional Genomics Data Society (FGED) Society, founded in 1999 as the MGED Society, is a registered International Society that advocates for open access to genomic data sets and works towards providing concrete solutions to achieve this. Our mission is to be a positive agent of change in the effective sharing and reproducibility of functional genomic data. Our work on defining minimum information specifications for reporting data in functional genomics papers (e.g., MIAME) have already enabled large data sets to be used and reused to their greater potential in biological and medical research. The FGED Society seeks to promote mechanisms to improve the reviewing process of functional genomics publications. We also work with other organizations to develop standards for biological research data quality, annotation and exchange. We actively develop methods to facilitate the creation and use of software tools that build on these standards and allow researchers to annotate and share their data easily. We promote scientific discovery that is driven by biological research efforts in data integration and meta-analysis.

http://fged.org/

Notes #

Spirit of openness - share everything

Functional Genomics Data Society & Its Mission #

In the beginning there were microarrays - MGED

MIAME - standard for exchange raw data microarray

too much to ask - researchers should publish fully documented code
do reviewers check these?
ArrayExpress and GEO have >6M high throughput assays from 30K functional genomic studies. use MIAME, so it’s working for this group
Many studies have shown the reuseability of these data

MINSEQE - minimal standards on nucleotide seq experiment.
General description of the aim, metadata, raw reads, processed data

FGED Standards: big data needs standards, GFED creates and aids the development of such

FGED is an open society, welcome feedback, input and volunteers

Q & A #

Q: (Stein) What is the journal policy in the continued evolution of this effort?

A: Publishers in general have very great interest and support. They are looking for things like this. PLoS - new data release policy. Publishers keen to see what community agreed upon standards are.

back to the speaker list →

Insights from the Genomic Analysis of 10,940 Exomes and 3,751 Whole Genomes Demystifying Running at Scale and the Scientific Results #

Andrew Carroll, DNAnexus, USA #

Abstract #

As one of five institutions participating in the global CHARGE Consortium, the Human Genome Sequencing Center (HGSC) at Baylor College of Medicine needed a compute and data management infrastructure solution to handle the massive amount of data (3,751 whole genomes and 10,940 exomes) they would be processing for this project. The large burst computational demands for this project would have unacceptably taxed existing resources, requiring either many months of using spare capacity or forcing other users off the cluster for 4-5 weeks to complete it faster. To address this challenge, HGSC, DNAnexus, and Amazon Web Services (AWS) teamed up to deploy a cloud-based infrastructure that could handle this ultra large-scale genomic analysis project quickly and flexibly, with zero capital investment. At the project’s peak, HGSC was able to spin up more than 20,000 cores on-demand in order to run the analysis pipeline of the CHARGE data. During this period, HGSC was running one of the largest genomics analysis clusters in the world.

Notes #

DNAnexus - 2009 spin out from Stanford. Darling of sucessful startups. Apply the Cloud at scale

Two parts:

Philosophy of the Cloud
Application to large project (10-11K exomes)

What is DNAnexus #

scalable solution deploys on AWS (Amazon Web Services) cloud
handle spitting out lots of nodes, sharing data accross users
publish own tools - external or internal

Scientific Vision: #

Challenges looming over data @ scale

Science like driving

car = bioinfo tool
these come out we can do things we couldn’t do before
car accidents (user error, car itself)
improving tools is important -> need to think about the infrastructure used to make these run

Tool development - profile runtimes and cost

optimize for resources (cpu, memory, bandwidth)
now: your tools don’t work on all platforms - configuration headaches
cloud: configure once, run where you want it to run

Benchmarking

Need good benchmark sets - prevent scientific degredation (unit test). Know that you are correct
drive scientific innovation
extend visualizations to reach to more basic biologists. expert bioinformaticians working with basic biologists
deploy at scale
collaboration - prevent data duplication, contribution

Tool Optimization

resource optimization - profile through
DNAnexus - waterfall view of tools! see parallelization

Benchmark sets

compile benchmarks and tools in a single place. can run all tools and benchmark sets. see differences between sets
Configure workflow ui - run 6 variant callers and compare.
visualization - how basic biologists will access the data

Collaborations

managing access to data - admin, viewer, collaborator (roles). can restrict
delivering the data - shipping large-scale data will always be faster and more robust than data transfer. local sftp works for small project. likely true forever

DNAnexus - HGSC-CHARGE Collaboration #

Analysis of 11K exomes and 4K whole genomes for CHARGE consortium.
Comput scale and distribution of results across 300 investigators

Baylor: 20 HiSeqs ~25TB of sequence per month

growth at an exponential rate
load on cluster - pretty much fully booked (w/ some planned down time)
Mercury DNAseq pipeline
- BWA + GATK realign + variant calling
- They took out the most computationally intensive parts of the pipeline and put in DNAnexus
- 10K exomes in 5 days
- 2K nodes, 3.5M cpu hours over 10 days
How much more do you get as your increase in scale?
- new variants as you increase the exome scale - plot sqrt(x)
- as we continue to sequence more and more we are going to find more and more rare variants
compared with variants found in first exome, more likely to be synonymous. variants found in lastest 5K+ - less synonymous
SIFT - tolerant at first, damaging later
Novel - exome 1, most found in dbSNP, exome 5K+ - not found in dbsnp

Q & A #

Q: (Schatz) On projects like this the first half is well structured, but gets very ad-hoc by the end. How is this structured in DNAnexus for ad-hoc queries?

A: We take advantage of the expertise of the ppl working with us. Relying on the CHARGE consortium in collaboration. Directed hypothesis generated by partners.

Q: Can you elaborate on the datasets you’re using as benchmarks?

A: An oppourtunity for the community to come together - benchmarking sets are the way to go, DNAnexus gives us an oppourity to go in this space. Not curators of benchmarks sets.

back to the speaker list →

The Next 10 Years of Quantitative Biology #

Michael Schatz, Cold Spring Harbor Laboratory, USA #

Abstract #

Topic change, no abstract

Slides #

http://schatzlab.cshl.edu/presentations/2014.03.24.Keystone%20BigData.pdf

Notes #

Questions in Biology - some broad, some focused #

Interesting things about these questions - there is no single instrument that answers each of these questions

Answer these questions:

big stack of technologies
raw sensors at the bottom
then systems, compute systems, algorithms, machine learning, > results
Will walk through this pyramid and see what major trends

Bottom tier - sensors : Cost per Genome - drives much of the talks today. need scalability #

map where the major sequencing instruments are across the plant
interesting thing: how widely distributed they are (not like other fields)
worldwide capacity exceeds 15 Pbp/year… 25 Pbp/year on Jan 15 (Illumina X10 systems announcement)
How much is a PB: sequence human genome to 30x - 10K genomes - stack up on DVDs, 787 feet of DVDs (1/6 of a mile tall). 500 2 TB drives $500K

DNA data tsunami - growth of sequencing around 3x per year

not too distant future: ~1 exabyte by 2018
~1 zettabyte by 2024.
- How big? zettabyte is 1M PB
- stack of DVDs = 10B genomes = halfway to moon
- YouTube and astronomy datasets - roughly ~100PB today, growing exponentially

Sequencing Centres map - will be roughly the same

see widespread network of sequencing networks across the planet
biological sensor network nanopore - @ewanbirney https://twitter.com/ewanbirney/status/448423540472422400 mobile - can embed in many remote locations (hospitals, schools, )
the rise of a digital immune system - Schatz. http://www.biomedcentral.com/content/pdf/2047-217X-1-4.pdf

compression will help - need to be aggressive about throwing out data

particle physics - strength here. massive amount of data produced is discarded
resequencing will be negligible
precious-ness of the data/sample: cancer is the high watermark of complexity. in principle we may want to hold on to every read

major applications:

human health - where $$ available
widespread distributed mobile sensors
digital immune system - constantly monitoring what’s coming up (microbes, etc)

Next phase - compute, algorithms #

the compute will be everywhere - Cloud
I had the distinction of having the first paper in PubMed that ever used AWS for sequence analysis
will be multi-cloud - specialized for geographic or political reasons. centric on model organism or disease of study. makes sense to have concentrated system

compute - parallel algorithm spectrum

better parallelization
embarrassingly parallel: problems most easy to run on cluster. building a city? hire 100s of crews, build in parallel
loosely couple algorithms: MapReduce. building skyscraper- can’t build every floor at the same time. a lot of the work is independent but then is aggregated together
Tightly coupled: graphs and MD simulations. growing one massive tree - more farmers will not help. “nine women cannot make a baby in one month”

Better hardware:

MUMmerGPU
specialized hardware (GPU)

Crossbow - algorithm on map reduce

using many commodity computers - run algorithm in parallel (map reduce)
use Bowtie and SOAPsnp
compelling example of cloud computing in genomics. transfer time and cost – improving
challenge: requires more applications!
each algorithm requires customization - need skilled developers

PanGenome alignment and assembly

shifting to paradigm where raw input is set of complete genomes
emerging long read sequencing technologies
can assemble entire microbial / yeast genomes into complete assemblies
could be the case we have complete human genomes - get started now
start with set of individual genomes - segments of genomes in graph. get context by graph - De Bruijn graph

See major informatics centers on topics

moving code to data
driven by parallel algorithms/hardware
shift to large populations
applications: read mapping will fade out, new problems (at population level) will replace it

Top of slice: Results: work at CSHL - genetics of autism #

Sample set: 3K families - simplex families

one child has autism but rest of siblings not autistic
sequence exomes of all individuals across families
what do we observe relative to siblings/parents?
focus: gene killing mutations. loss of function/ specific to autistic children to find genes associated with the disease
identifying SNPs quite mature - GATK broad, handles biases

SCALPEL - find indels from short read sequencing data

combine best of alignment and assembly
use standard aligner to map reads to genome. purpose of this alignment is to localize the problem (locally, not globally - one exon/region at a time)
extract out reads that localize to a particular part
on the fly assembly with de Bruijn graph
find end to end haplotype paths spanning graph
align assembled sequence to region

Experimental analysis and validation

selected one deep coverage exome for deep analysis
GATK, SCALPEL, SOAPindel
99% accuracy where all overlap
specific to SCALPEL - 77% (more than others)

de novo genetics of autism - same number of mutations as siblings

but gene killers - enrichment in autistic kids
2:1 enrichment in nonsense
2:1 enrichment of frameshift
4:1 splice site mutations
correlation to age of father

available in bio archive, code available in SourceForge

Potential for big data #

folks from Google: flu trends in nature - 2009
google searches for flu like symptoms - then outbreaks occur
Fallacy of big data? - They’ve gotten it wrong. ‘big data hubris’ - assumption that big data are a substitute for data collection and analysis. pipelines are extremely important
risks of big data - given birthday and hometown - can predict SSN with good accuracy

Power from data aggregation - champion ourselves and the future #

mindful of risks - over-fitting, reproducibility
caution is prudent
data aggregation isn’t going to solve anything- being critical - does this make sense? continuous feedback loop

What is a data scientist? Many fields. To be really successful, you need strengths, experience and expertise in these fields.

Q & A #

Q: Observation: Talking about the sequencing coming down in price - What happens when sequencing becomes so cheap and democratized that any can do this? How do we as a community get the legislature to start thinking of these privacy concern? We need to look at this data

A: No simple answer. Part of it will come through scientific discoveries - congressmen pay attention when there’s big breakthroughs. Lobby - we need to talk to the rest of the world. Part of it going to come in reponse when there are outbreaks - when data is abused. There’s already some legislation in place so you cna’t get discriminated against for, say, insurance. But there’s implicit discriminations. Don’t know how to fix outside of education and reaching out to the next gen.

Q: (Mesirov)

Congratulations: terrific meeting!
30+ years ago I heard Grace Murray Hopper speak - made a comment about how we are all going to be drowning in data. All kinds of data. I appreciated your comment on what we keep. Important: we have some kind of metric of utility - huge amounts of it not touched for long periods of time. Think about what happens with this data that is never used again. Otherwise we’re all going to drown

A: The utility of data is certainly something to be considered. We’ve bad at estimating it. We’re all hoarders. System failing recently- can’t copy off a PB of data fast enough. Trying to assess the preciousness of data and time. Some metrics are hard to measure. I anticipate the storage vendors will get better at providing tools to assess what is on a filesystem. Tools today are crude, i hope these will improve. At the very least we can identify if there are big datasets we haven’t accessed in years

Q: (Swedlow) At Dundee, hierarchical filesystems backed up by tape. Primary data is images and proteomics - 95% of it is not touched again 3 months later. Graph representations of sequences - we will be doing the same thing with images. Concerned with the computational cost of recalculating these graphs. How expensive will recalculation be?

A: today it’s expensive - but this is an opportunity for research. For example: at level of suffix trees - construction methods. We can dust those off and improve algorithms.

back to the speaker list →

Other posts in this series: #

Databases and Clouds #

Laura Clarke, European Bioinformatics Institute, UK #

Notes #

1000 genomes project #

Challenges #

Dan Stanzione, University of Texas at Austin, USA #

Abstract: #

Notes #

cyberinfrastructure #

iPlant #

iPlant Services #

Data store: #

Big Data Observations #

Jill P. Mesirov, CIO at Broad Institute, USA #

Notes #

GenomeSpace - fairly recent project #

New Trends #

What do we need to take advantage? #

Will focus on -> More complex workflows/algorithsm #

GenomeSpace: #

How do I use it? #

Join the community! http://www.genomespace.org/ #

Francis Ouellette, Ontario Institute for Cancer Research, Canada #

Notes #

Functional Genomics Data Society & Its Mission #

Andrew Carroll, DNAnexus, USA #

Notes #

What is DNAnexus #

Scientific Vision: #

DNAnexus - HGSC-CHARGE Collaboration #

Michael Schatz, Cold Spring Harbor Laboratory, USA #

Slides #

Notes #

Questions in Biology - some broad, some focused #

Bottom tier - sensors : Cost per Genome - drives much of the talks today. need scalability #

Next phase - compute, algorithms #

Top of slice: Results: work at CSHL - genetics of autism #

Potential for big data #

Power from data aggregation - champion ourselves and the future #

Other posts in this series: #

Now read this

Biocuration 2014: Battle of the new curation methods