Big Data in Biology: Databases and Clouds
Series Introduction: I attended the Keystone Symposia Conference: Big Data in Biology as the Conference Assistant last week. I set up an Etherpad during the meeting to take live notes during the sessions. I’ve compiled all the abstracts, notes and slides (where available) here. Shout-out to David Kuo for helping edit the notes.
Warning: These notes are somewhat incomplete and mostly written in broken english
Other posts in this series: #
- Big Data in Biology
- Big Data in Biology: Large-scale Cancer Genomics
- Big Data in Biology: Big Data Challenges and Solutions: Control Access to Individual Genomes
- Big Data in Biology: Personal Genomes
- Big Data in Biology: Imaging/Parmacogenomics
- Big Data in Biology: The Next 10 Years of Quantitative Biology
Databases and Clouds #
Monday, March 24th, 2014 9:30am - 2:15pm
http://ks.eventmobi.com/14f2/agenda/35704/288348
http://ks.eventmobi.com/14f2/agenda/35704/288348
Speaker list #
Laura Clarke, European Bioinformatics Institute, UK
The 1000 Genomes Project, Community Access and Management for Large Scale Public Data -
[Abstract]
[Q&A]
Dan Stanzione, University of Texas at Austin, USA
The iPlant Collaborative: Cyberinfrastructure for 21st Century Biology -
[Abstract]
[Q&A]
Jill P. Mesirov, Broad Institute, USA
GenomeSpace: A Community Web Environment for Genomic Analysis Across Diverse Bioinformatic Tools -
[Abstract]
[Q&A]
Ronald C. Taylor, Pacific Northwest National Laboratory, USA (replaced by Francis Ouellette)
FGED: The Functional Genomics Data Society -
[Abstract]
[Q&A]
Andrew Carroll, DNAnexus, USA
Insights from the Genomic Analysis of 10,940 Exomes and 3,751 Whole Genomes Demystifying Running at Scale and the Scientific -
[Abstract]
[Q&A]
Michael Schatz, Cold Spring Harbor Laboratory, USA
The Next 10 Years of Quantitative Biology -
[Abstract]
[Q&A]
[slides]
The 1000 Genomes Project, Community Access and Management for Large Scale Public Data #
Laura Clarke, European Bioinformatics Institute, UK #
Abstract #
The 1000 genomes data continues to be the largest public variation resource available to the community. Providing coherent and useful resources based on this data continues to be a key goal for the project Data Coordination Center (DCC).
The resource now stands more than 500 Tbytes in size and nearly 500,000 files on the ftp site this presents challenges both for us to manage and for users to discovery what data we have available.
Here I both describe these challenges and present the solutions and tools the project has created to enable the widest level of usefulness for the 1000genomes project data.
Notes #
1000 genomes project #
- Largest human project
- Aims:
- complete a baseline of human variation
- all variation - at 1% MAF of higher genome wide.
- 0.1%-0.5% MAF in exonic regions
- structural variations as well as SNVs
- BAM and VCF formats started on this project
- 99% of all variation in an individual is already present in the public catalogue
- sequenced 26 populations around the globe. Started with HapMap, nhgri helped get more
- collaboration - 10 different sequencing centres. many analysis groups
Strategy
- collect shotgun reads, align to reference
- detect variations based on alignment from all samples. statistical issues for allowing errors in sampling
- in 2008 this was impossible at scale
Analysis Approach
- final phase 70bp+ illumina. take much more complicated variations and create phage genomes
- multiple centres, multiple technologies
In final phase now
- technologies progressed so rapidly, can change aims in the duration of the project
- 0.5 PB of data
Challenges #
Data Transfer
- FTP site growing
- 20TB 2009 – 580 TB today
- synchronizing challenging
- download speeds. Aspera (propriety). Download and upload clients
Within Consortium Data Exchange
- Data Freezes
- stable release of sequence data
- dated sequecne index file
- alignments based on this index
- variant set calls created from these BAMs
- Machine Readable FTP Site: Text file which points to FTP
- Standardized naming formats: used sample and population names and what programs/technologies used
- Regular communication
Public Accessibility
- FTP site - raw data files ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/
- AWS Amazon Cloud
- web site
- ensembl browser
Tools to Assist Data Use
- Data slicer
- slicing remote BAM or VCF files
- web front end of samtools
- returns subsection of given file - subset by population, individual
- Variant Pattern Finder
- VCF to PED: haploview (PED)
- Ensembl Variant Effect Predictor
- Predicts functional consequences of variants - SNPs, Indels, Structural Variation
- Web & API based
- Can provide Sift and PolyPhen, HGVS, Refseq gene name
- Population Allele Frequency Tool (coming soon!): range of variations
Q & A #
Q: 1000 genomes project - many 340bp all deletions without insertions?
A: Quality - false discovery rate <5%. Sturctural variant very difficult. Wasn’t sufficiently confident in structural variations that aren’t deletions - did not include in db. Structural variations will always be more limited.
Q: Idea of a data freeze and recall - uuid, public key trust network - possible route?
A: sounds like a good idea
The iPlant Collaborative: Cyberinfrastructure for 21st Century Biology #
Dan Stanzione, University of Texas at Austin, USA #
Abstract: #
iPlant is a new kind of virtual organization, a cyberinfrastructure (CI) collaborative created to catalyze progress in computationally-based discovery in plant biology. iPlant has created a comprehensive and widely used CI, driven by community needs, and adopted by a number of large-scale informatics projects and thousands of individual users. iPlant holds more than 1.5 petabytes of user data comprising several hundred million files today, and is thus deeply involved in the “Big Data” challenges of biologists, from storing to analyzing to sharing rapidly growing amounts of data.
This talk will outline the iPlant CI, and discuss what iPlant is doing today to address data challenges, as well as plans for the future. The talk will also address trends the project sees in how users are handling data, and the potential technological solution on the horizon to address them.
iPlant is supported by the National Science Foundation via Award #DBI-1265383.
Notes #
iPlant - co-director (until 8 weeks ago). Passed co-director to Matthew W. Vaughn
What is iPlant:
community driven organization building cyberinfrastructure for the plant (and animal) science
cyberinfrastructure #
combination of computing, data storage, networking and humans.
to achieve some scientific goal
iPlant #
- 6th year
- 14K researchers access services or data - from ecology to epigenomics
Achievements through iPlants open infrastructure
- BIEN - generate range maps for species
- 1KP project - 100M sequence reads - richer tree of plant data. blast annotation
- animal mandate - cattle/buffalo piplines
- GWAS and more
iPlant Services #
- Atmosphere - on demand cloud computing: friendly front end for cloud - web interface. pick images. can log in via shell to image
- iPlant data store
- discovery environment. rich catalog of bioinformatics machines/tools you can choose from. put together pipelines - gui
- iPlant APIs: embed iplant CI capabilities
- foundation of computation by TACC
- TACC: one of the worlds largest data providers. provides a comprehensive cyberinfrastructure ecosystem. not just machines, tools, apis, team
Powered by iPlant
- build your own informatics project!
- rPlant - r project built on iPlant
- araport - se iplant services
Workflow Optimization and Consulting
- 12 year analysis - down to 3 days on cluster, working with iPlant
- Code optimization: PINT - write code in R, rewrote it done in 4h
Democratizing access to high-throughput genome annotation
Data store: #
- federated sources iRODS (DFC) - AWS
- geographic replication - U of Austin and TACC
- 600 TB user data and growing
700 TB Galaxy
200 TB specieal projects - community collections
- 100GB in 27min - UCBerkley to UA
- Evolving the Data Strategy: open file storage, few roles. iDS - some filetype detection, manual meta data tagging, elastic search
- Scaling for team science: easy scaling when too large for laptop to open
Big Data Observations #
- About 5B files at TACC - 3.5 more than Jan 2013
- We delete at least 300M files per month
- About 30PB in use
- file count and size increasing rapidly
- 95% of I/O operations don’t actually move data
Soap Box
- Average practice is getting worse in data transfer, file i/o and programming
- best practice- amazing! - 1,024 core job, generate 1PB in 2h, reanalyzed dozen times < day. good user, know what they’re doing
- worst practice - 128 core job- generated 80x metadata traffic of above job and crashed filesystem.
moving 1PB over a 10GB/s network via http will take about 1.4 years
c: f=fopen(“file.txt”, “w”); //3 metadata writes
python: f=open(‘file.txt’, ‘w’) // 17 metadata writes - Cloud lets us do stupid things we do in software and run it on a large scale
Speed things up
- Technological solutions are coming that can meet demand
- machine learning, data transfer can help speed things up. But we still need good software
Q & A #
Q: (illumina) Are there tools to analyze applications to determine their lack of efficiencies?
A: Yes, there are. Caveats: some tools - perfexpert (tooling and analysis) - low level performance tools. Not as useful with non-low level languages. Not great for python.
Build job stats on system - can tell you efficiencies of your code on their system.
Q: (Mesirov) What’s your process on who gets to use it, who doesn’t?
A: iPlant: all resources NSF funded. some EXSEED. xrack - any open science funded researcher. Must be US and published.
iPlant - will open up under 10K hours. tiers on higher use, compare with other users.
GenomeSpace: A Community Web Environment for Genomic Analysis Across Diverse Bioinformatic Tools #
Jill P. Mesirov, CIO at Broad Institute, USA #
Abstract #
Over the last two decades genomics has accelerated at an exponential pace, driven by new sequencing and other genomic technologies, promising to transform biomedical research. These data offer a new era of potential for the understanding of the basic mechanisms of disease and identification of novel treatments. Concurrently, there has been a growing emphasis on integrating all of the available data types to better inform scientific discovery. There are now thousands of bioinformatic analysis and visualization tools for this wealth of data. To leverage these tools to make biomedical discoveries, biologists must be empowered to access them and combine them in creative ways to explore their data. However, this vision has been out of reach for almost all biomedical researchers.
We will describe and give example applications of GenomeSpace, http://www.genomespace.org, an open environment that brings together a community of 14 diverse computational genomics tools and data sources, and enables scientists to easily combine their capabilities without the need to write scripts or programs. Begun as a collaboration of six core bioinformatics tools - Cytoscape (UCSD), Galaxy (Penn State University), GenePattern (Broad Institute), Genomica (Weitzmann Institute), the Integrative Genomics Viewer (Broad Institute), and the UCSC Genome Browser (UCSC) - the GenomeSpace community continues to grow. GenomeSpace features support for cloud-based data storage and analysis, multi-tool analytic workflows, automatic conversion of data formats, and ease of connecting new tools to the environment.
Funding provided by NHGRI and Amazon Web Services
Notes #
GenomeSpace - fairly recent project #
Background
- accelerated rate at which biological data acquired. enabled us to do all sorts of global analysis projects
- Swamped by development of next gen sequencing technologies
- availability of this data has led to progress towards goals to understand disease at the molecular level and understand the genetic basis and mechanisms for disease
- now know 3K mendelian disease genes and 5K loci have been associated with over 6K common diseases and traits
- ENCODE- all functional elements of genome and dark matter
- ICGC/TCGA tumour types
New Trends #
- cost down, methods up
- more types of data are acquired
- miRNA, Copy Number, microRNA, epigenetic- methylation, RNAI. more sensitive and less messy data
- increase in integrative approaches. leveraging all these kinds of data
- more large-scale projects (x-lab, x-institution)
- moved from single gene analysis -> pathway/network view. how genes really work
What do we need to take advantage? #
integrate large data sets and multiple data types.
data management/identification - how do find what helps me?
more complex workflows and algorithms
- increasing computational complexity
- compute power demands
- need to interoperate methods and tools
- available and accessible to biologists: in a more friendly way. can’t be just the computational cadre - but whole community
visualize large integrated data sets:
viewers, help us look at reads and see if that call makes sense
validate computational results
Will focus on -> More complex workflows/algorithsm #
- interoperate methods and tools
- available to all
Integrative genomics
- tremendous advances last 10 years
- by integrating lots of different kinds of data
Difficulty of getting these tools to work together - need to develop infrastructure.
Challenge: flood of data & proliferation of tools
- tools don’t always play well together, want to use them all in one place
- 2012: 7-10K bioinformatics tools on the web. just Broad - 60-70 tools. not counting internal tools
- 5K public databases
- use case (breat cancer): 12 steps, 6 tools, 7 transitions
- transitions -> data formats different between tools
- how can we democratize this data analysis and bring to the rest of the community?
One monolithic tools OR cooperative approach
- lightweight layer for interoperability with automatic data transfer. lightest weight possible - do data transfer for the users
- leverage multiple groups and existing tools
- access to familiar tools with usual look and feel. so users don’t have to learn how to use them again
GenomeSpace: #
- shared vision of 6 bioinformatics tools. get them to talk to each other very easily
- have it live in the cloud - server in cloud. talks to GS data sources or components
- 14 tools right now (4 or 5 on the way). infrastructure at a place where the new tools were enabled in ~1 programmer day. portals: access portals from genome space (eg IM)
- Use GenomeSpace S3 storage or add your own Amazon account. Dropbox can be connected. in development: OpenStack & Google Drive
How do I use it? #
Go to cookbook for: how to build a more complex analysis,
How to leverage these different tools
genome space recipe collection
- summary of what the recipe does & high level steps and tools
- summary of workflow and steps in recipe
- video of someone going through the recipe
- more detail on recipe - real biological use case
- walk through a protocol of all detailed steps
- easy to use!
Join the community! http://www.genomespace.org/ #
open source, on bitbucket https://bitbucket.org/genomespace/
Q & A #
Q: (Stein) loved the recipes. Regular recipes still work 50 years later (broccoli doesn’t change). Bioinformatics paper 10 years ago will not work. How much time and effort is required to create a recipe in an environment where tools will be updated? Will it work in 5 years?
A: Tried to limit the scope of the recipes - not beginning to end paper. More simple - just 2 or 3 tools. Committed to setting up steering committee for recipe collection to keep them honest.
RNASeq - many are beginning to use in their work. Yet - methods for analyzing RNASeq hasn’t been settled. Challenge they recognize. Community resource - users can report when recipes aren’t working. Go to forum.
Q: (illumina) Data from different sources, does GenomeSpace provide info on challenges on combining different data?
A:
Can do: put warnings. Watch out for the follow… etc. People who develop these recipes much understand the workflow fairly well so they know the gotchas.
Can’t do: cannot anticipate all the ways in which a biologist will misuse resource
People mis-use tools. Try to give enough info and warning to keep the probability low.
Q: followup: Account for differences in platforms?
A: Don’t have funding for all, but we do contact vendors.
Q: Thank you for making something more user friendly!
Q: Clinical data - do you have the security to handle this?
A: Security that Amazon Cloud provides. New round of funding: agreed to put warnings for ppl who are uploading data. If you have data that needs to be kept private - can use your own Amazon S3/Dropbox.
GenomeSpace does not do analysis - it’s on the tools.
Q: (IBM - Royyuru) Reproducibility - read about a tool in a paper, but can’t reproduce. Can GenomeSpace add machine readable script to run the tool?
A: Can’t go into tools themselves - lightweight. Will talk offline.
FGED: The Functional Genomics Data Society #
Francis Ouellette, Ontario Institute for Cancer Research, Canada #
- (Replaced: Ronald C. Taylor, Pacific Northwest National Laboratory, USA)
Selected on merit - not invited talk. Ron has laryngitis - Francis Ouellette is presenting slides.
Abstract #
The Functional Genomics Data Society (FGED) Society, founded in 1999 as the MGED Society, is a registered International Society that advocates for open access to genomic data sets and works towards providing concrete solutions to achieve this. Our mission is to be a positive agent of change in the effective sharing and reproducibility of functional genomic data. Our work on defining minimum information specifications for reporting data in functional genomics papers (e.g., MIAME) have already enabled large data sets to be used and reused to their greater potential in biological and medical research. The FGED Society seeks to promote mechanisms to improve the reviewing process of functional genomics publications. We also work with other organizations to develop standards for biological research data quality, annotation and exchange. We actively develop methods to facilitate the creation and use of software tools that build on these standards and allow researchers to annotate and share their data easily. We promote scientific discovery that is driven by biological research efforts in data integration and meta-analysis.
Notes #
Spirit of openness - share everything
Functional Genomics Data Society & Its Mission #
In the beginning there were microarrays - MGED
MIAME - standard for exchange raw data microarray
- too much to ask - researchers should publish fully documented code
- do reviewers check these?
- ArrayExpress and GEO have >6M high throughput assays from 30K functional genomic studies. use MIAME, so it’s working for this group
- Many studies have shown the reuseability of these data
MINSEQE - minimal standards on nucleotide seq experiment.
General description of the aim, metadata, raw reads, processed data
FGED Standards: big data needs standards, GFED creates and aids the development of such
FGED is an open society, welcome feedback, input and volunteers
Q & A #
Q: (Stein) What is the journal policy in the continued evolution of this effort?
A: Publishers in general have very great interest and support. They are looking for things like this. PLoS - new data release policy. Publishers keen to see what community agreed upon standards are.
Insights from the Genomic Analysis of 10,940 Exomes and 3,751 Whole Genomes Demystifying Running at Scale and the Scientific Results #
Andrew Carroll, DNAnexus, USA #
Abstract #
As one of five institutions participating in the global CHARGE Consortium, the Human Genome Sequencing Center (HGSC) at Baylor College of Medicine needed a compute and data management infrastructure solution to handle the massive amount of data (3,751 whole genomes and 10,940 exomes) they would be processing for this project. The large burst computational demands for this project would have unacceptably taxed existing resources, requiring either many months of using spare capacity or forcing other users off the cluster for 4-5 weeks to complete it faster. To address this challenge, HGSC, DNAnexus, and Amazon Web Services (AWS) teamed up to deploy a cloud-based infrastructure that could handle this ultra large-scale genomic analysis project quickly and flexibly, with zero capital investment. At the project’s peak, HGSC was able to spin up more than 20,000 cores on-demand in order to run the analysis pipeline of the CHARGE data. During this period, HGSC was running one of the largest genomics analysis clusters in the world.
Notes #
DNAnexus - 2009 spin out from Stanford. Darling of sucessful startups. Apply the Cloud at scale
Two parts:
- Philosophy of the Cloud
- Application to large project (10-11K exomes)
What is DNAnexus #
- scalable solution deploys on AWS (Amazon Web Services) cloud
- handle spitting out lots of nodes, sharing data accross users
- publish own tools - external or internal
Scientific Vision: #
Challenges looming over data @ scale
Science like driving
- car = bioinfo tool
- these come out we can do things we couldn’t do before
- car accidents (user error, car itself)
- improving tools is important -> need to think about the infrastructure used to make these run
Tool development - profile runtimes and cost
- optimize for resources (cpu, memory, bandwidth)
- now: your tools don’t work on all platforms - configuration headaches
- cloud: configure once, run where you want it to run
Benchmarking
- Need good benchmark sets - prevent scientific degredation (unit test). Know that you are correct
- drive scientific innovation
- extend visualizations to reach to more basic biologists. expert bioinformaticians working with basic biologists
- deploy at scale
- collaboration - prevent data duplication, contribution
Tool Optimization
- resource optimization - profile through
- DNAnexus - waterfall view of tools! see parallelization
Benchmark sets
- compile benchmarks and tools in a single place. can run all tools and benchmark sets. see differences between sets
- Configure workflow ui - run 6 variant callers and compare.
visualization - how basic biologists will access the data
Collaborations
- managing access to data - admin, viewer, collaborator (roles). can restrict
- delivering the data - shipping large-scale data will always be faster and more robust than data transfer. local sftp works for small project. likely true forever
DNAnexus - HGSC-CHARGE Collaboration #
Analysis of 11K exomes and 4K whole genomes for CHARGE consortium.
Comput scale and distribution of results across 300 investigators
Baylor: 20 HiSeqs ~25TB of sequence per month
- growth at an exponential rate
- load on cluster - pretty much fully booked (w/ some planned down time)
- Mercury DNAseq pipeline
- BWA + GATK realign + variant calling
- They took out the most computationally intensive parts of the pipeline and put in DNAnexus
- 10K exomes in 5 days
- 2K nodes, 3.5M cpu hours over 10 days
- How much more do you get as your increase in scale?
- new variants as you increase the exome scale - plot sqrt(x)
- as we continue to sequence more and more we are going to find more and more rare variants
- compared with variants found in first exome, more likely to be synonymous. variants found in lastest 5K+ - less synonymous
- SIFT - tolerant at first, damaging later
- Novel - exome 1, most found in dbSNP, exome 5K+ - not found in dbsnp
Q & A #
Q: (Schatz) On projects like this the first half is well structured, but gets very ad-hoc by the end. How is this structured in DNAnexus for ad-hoc queries?
A: We take advantage of the expertise of the ppl working with us. Relying on the CHARGE consortium in collaboration. Directed hypothesis generated by partners.
Q: Can you elaborate on the datasets you’re using as benchmarks?
A: An oppourtunity for the community to come together - benchmarking sets are the way to go, DNAnexus gives us an oppourity to go in this space. Not curators of benchmarks sets.
The Next 10 Years of Quantitative Biology #
Michael Schatz, Cold Spring Harbor Laboratory, USA #
Abstract #
Topic change, no abstract
Slides #
http://schatzlab.cshl.edu/presentations/2014.03.24.Keystone%20BigData.pdf
Notes #
Questions in Biology - some broad, some focused #
Interesting things about these questions - there is no single instrument that answers each of these questions
Answer these questions:
- big stack of technologies
- raw sensors at the bottom
- then systems, compute systems, algorithms, machine learning, > results
- Will walk through this pyramid and see what major trends
Bottom tier - sensors : Cost per Genome - drives much of the talks today. need scalability #
- map where the major sequencing instruments are across the plant
- interesting thing: how widely distributed they are (not like other fields)
- worldwide capacity exceeds 15 Pbp/year… 25 Pbp/year on Jan 15 (Illumina X10 systems announcement)
- How much is a PB: sequence human genome to 30x - 10K genomes - stack up on DVDs, 787 feet of DVDs (1/6 of a mile tall). 500 2 TB drives $500K
DNA data tsunami - growth of sequencing around 3x per year
- not too distant future: ~1 exabyte by 2018
- ~1 zettabyte by 2024.
- How big? zettabyte is 1M PB
- stack of DVDs = 10B genomes = halfway to moon
- YouTube and astronomy datasets - roughly ~100PB today, growing exponentially
Sequencing Centres map - will be roughly the same
- see widespread network of sequencing networks across the planet
- biological sensor network nanopore - @ewanbirney https://twitter.com/ewanbirney/status/448423540472422400 mobile - can embed in many remote locations (hospitals, schools, )
- the rise of a digital immune system - Schatz. http://www.biomedcentral.com/content/pdf/2047-217X-1-4.pdf
compression will help - need to be aggressive about throwing out data
- particle physics - strength here. massive amount of data produced is discarded
- resequencing will be negligible
- precious-ness of the data/sample: cancer is the high watermark of complexity. in principle we may want to hold on to every read
major applications:
- human health - where $$ available
- widespread distributed mobile sensors
- digital immune system - constantly monitoring what’s coming up (microbes, etc)
Next phase - compute, algorithms #
- the compute will be everywhere - Cloud
- I had the distinction of having the first paper in PubMed that ever used AWS for sequence analysis
- will be multi-cloud - specialized for geographic or political reasons. centric on model organism or disease of study. makes sense to have concentrated system
compute - parallel algorithm spectrum
- better parallelization
- embarrassingly parallel: problems most easy to run on cluster. building a city? hire 100s of crews, build in parallel
- loosely couple algorithms: MapReduce. building skyscraper- can’t build every floor at the same time. a lot of the work is independent but then is aggregated together
- Tightly coupled: graphs and MD simulations. growing one massive tree - more farmers will not help. “nine women cannot make a baby in one month”
Better hardware:
- MUMmerGPU
- specialized hardware (GPU)
Crossbow - algorithm on map reduce
- using many commodity computers - run algorithm in parallel (map reduce)
- use Bowtie and SOAPsnp
- compelling example of cloud computing in genomics. transfer time and cost – improving
- challenge: requires more applications!
- each algorithm requires customization - need skilled developers
PanGenome alignment and assembly
- shifting to paradigm where raw input is set of complete genomes
- emerging long read sequencing technologies
- can assemble entire microbial / yeast genomes into complete assemblies
- could be the case we have complete human genomes - get started now
- start with set of individual genomes - segments of genomes in graph. get context by graph - De Bruijn graph
See major informatics centers on topics
- moving code to data
- driven by parallel algorithms/hardware
- shift to large populations
- applications: read mapping will fade out, new problems (at population level) will replace it
Top of slice: Results: work at CSHL - genetics of autism #
Sample set: 3K families - simplex families
- one child has autism but rest of siblings not autistic
- sequence exomes of all individuals across families
- what do we observe relative to siblings/parents?
- focus: gene killing mutations. loss of function/ specific to autistic children to find genes associated with the disease
- identifying SNPs quite mature - GATK broad, handles biases
SCALPEL - find indels from short read sequencing data
- combine best of alignment and assembly
- use standard aligner to map reads to genome. purpose of this alignment is to localize the problem (locally, not globally - one exon/region at a time)
- extract out reads that localize to a particular part
- on the fly assembly with de Bruijn graph
- find end to end haplotype paths spanning graph
- align assembled sequence to region
Experimental analysis and validation
- selected one deep coverage exome for deep analysis
- GATK, SCALPEL, SOAPindel
- 99% accuracy where all overlap
- specific to SCALPEL - 77% (more than others)
de novo genetics of autism - same number of mutations as siblings
- but gene killers - enrichment in autistic kids
- 2:1 enrichment in nonsense
- 2:1 enrichment of frameshift
- 4:1 splice site mutations
- correlation to age of father
available in bio archive, code available in SourceForge
Potential for big data #
- folks from Google: flu trends in nature - 2009
- google searches for flu like symptoms - then outbreaks occur
- Fallacy of big data? - They’ve gotten it wrong. ‘big data hubris’ - assumption that big data are a substitute for data collection and analysis. pipelines are extremely important
- risks of big data - given birthday and hometown - can predict SSN with good accuracy
Power from data aggregation - champion ourselves and the future #
- mindful of risks - over-fitting, reproducibility
- caution is prudent
- data aggregation isn’t going to solve anything- being critical - does this make sense? continuous feedback loop
What is a data scientist? Many fields. To be really successful, you need strengths, experience and expertise in these fields.
Q & A #
Q: Observation: Talking about the sequencing coming down in price - What happens when sequencing becomes so cheap and democratized that any can do this? How do we as a community get the legislature to start thinking of these privacy concern? We need to look at this data
A: No simple answer. Part of it will come through scientific discoveries - congressmen pay attention when there’s big breakthroughs. Lobby - we need to talk to the rest of the world. Part of it going to come in reponse when there are outbreaks - when data is abused. There’s already some legislation in place so you cna’t get discriminated against for, say, insurance. But there’s implicit discriminations. Don’t know how to fix outside of education and reaching out to the next gen.
Q: (Mesirov)
- Congratulations: terrific meeting!
- 30+ years ago I heard Grace Murray Hopper speak - made a comment about how we are all going to be drowning in data. All kinds of data. I appreciated your comment on what we keep. Important: we have some kind of metric of utility - huge amounts of it not touched for long periods of time. Think about what happens with this data that is never used again. Otherwise we’re all going to drown
A: The utility of data is certainly something to be considered. We’ve bad at estimating it. We’re all hoarders. System failing recently- can’t copy off a PB of data fast enough. Trying to assess the preciousness of data and time. Some metrics are hard to measure. I anticipate the storage vendors will get better at providing tools to assess what is on a filesystem. Tools today are crude, i hope these will improve. At the very least we can identify if there are big datasets we haven’t accessed in years
Q: (Swedlow) At Dundee, hierarchical filesystems backed up by tape. Primary data is images and proteomics - 95% of it is not touched again 3 months later. Graph representations of sequences - we will be doing the same thing with images. Concerned with the computational cost of recalculating these graphs. How expensive will recalculation be?
A: today it’s expensive - but this is an opportunity for research. For example: at level of suffix trees - construction methods. We can dust those off and improve algorithms.
Other posts in this series: #
- Big Data in Biology
- Big Data in Biology: Large-scale Cancer Genomics
- Big Data in Biology: Big Data Challenges and Solutions: Control Access to Individual Genomes
- Big Data in Biology: Personal Genomes
- Big Data in Biology: Imaging/Parmacogenomics
- Big Data in Biology: The Next 10 Years of Quantitative Biology