Big Data in Biology: Imaging/Parmacogenomics
Series Introduction: I attended the Keystone Symposia Conference: Big Data in Biology as the Conference Assistant last week. I set up an Etherpad during the meeting to take live notes during the sessions. I’ve compiled all the abstracts, notes and slides (where available) here. Shout-out to David Kuo for helping edit the notes.
Warning: These notes are somewhat incomplete and mostly written in broken english
Other posts in this series: #
- Big Data in Biology
- Big Data in Biology: Large-scale Cancer Genomics
- Big Data in Biology: Databases and Clouds
- Big Data in Biology: Big Data Challenges and Solutions: Control Access to Individual Genomes
- Big Data in Biology: Personal Genomes
- Big Data in Biology: The Next 10 Years of Quantitative Biology
Imaging/Parmacogenomics #
Tuesday, March 25th, 2014 1:00pm - 3:00pm
http://ks.eventmobi.com/14f2/agenda/35704/288362
Speaker list #
Susan Sunkin, Allen Institute for Brain Science, USA
Allen Brain Atlas: An Integrated Neuroscience Resource -
[Abstract]
[Q&A]
Jason R. Swedlow, University of Dundee, Scotland
The Open Microscopy Environment: Open Source Image Informatics for the Biological Sciences -
[Abstract]
[Q&A]
Douglas P. W. Russell, University of Oxford, UK
Short Talk: Decentralizing Image Informatics -
[Abstract]
[Q&A]
John Overington, European Molecular Biology Laboratory, UK
Spanning Molecular and Genomic Data in Drug Discovery -
[Abstract]
[Q&A]
Allen Brain Atlas: An Integrated Neuroscience Resource #
Susan Sunkin, Allen Institute for Brain Science, USA #
Abstract #
The Allen Brain Atlas (www.brain-map.org) is a collection of open public resources (2 PB of raw data, >3,000,000 images) integrating high-resolution gene expression, structural connectivity, and neuroanatomical data with annotated brain structures, offering whole-brain and genome-wide coverage. The eight major resources currently available span across species (mouse, monkey and human) and development. In mouse, gene expression data covers the entire brain and spinal cord at multiple developmental time points through adult. Mouse data also includes brain-wide long-range axonal projections in the adult mouse as part of the Allen Mouse Brain Connectivity Atlas.
Complementing the mouse atlases, there are four human and non-human primate atlases. The Allen Human Brain Atlas, the NIH-funded BrainSpan Atlas of the Developing Human Brain, and the NIH Blueprint NHP Atlas contain genome-wide gene expression data (microarray and/or RNA sequencing) and high-resolution in situ hybridization (ISH) data for selected sets of genes and brain regions across human and non-human primate development and/or in adult. In addition, the Ben and Catherine Ivy Foundation-funded funded Ivy Glioblastoma Atlas Project contains gene expression data in human glioblastoma.
While the Allen Brain Atlas data portal serves as the entry point and enables searches across data sets, each atlas has its own web application and specialized search and visualization tools that maximize the scientific value of those data sets. Tools include gene searches; ISH image viewers and graphical displays; microarray and RNA sequencing data viewers; Brain Explorer® software for 3D navigation and visualization of gene expression, connectivity and anatomy; and an interactive reference atlas viewer. For the mouse, integrated search and visualization is through automated signal quantification and mapping to a common reference framework. In addition, cross data set searches enable users to query multiple Allen Brain Atlas data sets simultaneously.
Notes #
10 years of work and >200 ppl contribution.
Allen Institute: primarily studying mouse & human #
- largest publicly available neuroscience resource
- gene expression to connectivity, cell type and circuitry
- RNA-Seq
- generated in standardized manner then mapped to framework
- generated 3PB of data
- mouse brain atlas - mouse spinal cord, mouse developing, then human brain, human dev brain
- all data accessed through data portal http://www.brain-map.org/
Allen mouse brain Atlas #
- genome wide cellular resolution atlas of gene expression in adult mouse brain - in situ hybridization
- 20K genes surveyed
- informatics goals: aid search, navigation and visualization (make it easy to find what you’re looking for)
informatics pipeline: broken down to
- preprossing
- detection
- alignment: mapped to 3d space- > where expression and how much in brain
- griding
- search
- production - very product focused. Publicly available. Mine data and ask biological questions.. end with expression data matrix
Tools to harness data generated from the pipeline
- 3d viewing tool to view neuro-anatomy and 3d gene expression for one or multiple experiments
- gene expression summaries
- synchronization feature- same location different experiments
- image tool etv- higher resolution image viewer. interactive 3d representation. probe and gene data available. histogram of expression energy. nice snapshot of expression, decide if they’ll do a deeper dive in info
- Reference atlas -
- structure ontology
- anontated reference atlas place
- can look at experimental image and look up regions
- grid data search - users can search over 25K datasets to find genes with specific expresion pattern
- differential search: high expression in one set (target) compared to contrast
- correlative search: find genes with similar spatial expression profile
Developing mouse brain atlas #
- build on allen mouse brain atlas
- pick genes for neural development
- use reference atlas
- create of 3d and 4d tools and data analysis
- high qualitiy specimens selected, stained, generate images, annotate regions, make 2d and 3d output (adobe illustrator)
- Search and analysis tools - pick 2d images and get extrapolated 3d expression
- Imaging synchronization feature - variety of transcription factor targets
- select location as seed object
- will snychronize all the images you are looking at to the same location
Allen mouse connectivity atlas
- high res map of neural connections in whole mouse brain. generate comprehensive db of neural projections. generate 140images per specimen at 100 micron intervals
- one mouse brain after injected is embedded and placeed on stage two photon images taken, then brain moved over and section slice taken off. then another image taken. block face imaging throughout the entire brain
- looking at fluorescent projects
- spacially map brain to 3d reference model
- comprehensive coverage for projection mapping - wt mouse but interested in cell type. projection profiling with cree-driver (sp?) mice
- can look at trajectory and topography
Other tools - brain wide data - can pin point region of interest adn dive deeper
Allen Human Brain Atlas #
- all genes - all structures. classical histology and neuroanatomy
- cellular resolution data - scale. only looked at a subset of genes on a subset of structures (very question driven, autism, schizophrenia, etc)
- not possible to process whole genome brain. generate large slabs - create a jigsaw puzzle and assemble at the end
- generate histology data, neuranatomical regions of interest generated
- LIMS system to assemble the puzzle
- structural ontology - to generate summary stats
- Search: search by gene or structure, neuroblast correlative search, differential serach
- 3D brain explorer
- Tissue acquisition processing. postmortem brains. no neuropsychiactric disorder
- MR Registration volume renderings: rigid and non-rigit registering had to be done
- tissue sampling: slabs partitioned, sectioned and map back in MR space
- tissue block to MR Registration: place landmarks on scans matched with corresponding image in 3d space
Developing human Brain project #
four main components
- developmental transcriptome
- prenatal microarray: hi res, 300 distinct structures
- ISH: just a subset of regions/genes
- reference atlases: few generated for this project (prenatal and adults), include histology and imaging data
Prenatal - LMD Microarray Data
- fresh tissue frozen and slabbed
- histology determines regions of interest
- sent for hybridization to Agilent microarrays. same as adult data for x-comparison
- display with online tool: anatomical view and heat map view
Q & A #
Q: (Stein) interested in how labour intensive human tissue blocks were- were the markers placed by hand?
A: Not for every Z level of the MRI, but yes labour intensive. Many steps in order to use the automated pipeline.
Q: (Schatz) at CSHL big study in exome sequencing - which of these genes are expressed in brain at various levels of development?
A: Use our API to pull out data from different datasets to produce that.
Q: Different imaging methods and approaches - what’s the Allen’s approach to presenting the information in some way that could be queries at different domains and at the cell level?
A: The level of registration is not down to cell - it’s domains.
The Open Microscopy Environment: Open Source Image Informatics for the Biological Sciences #
Jason R. Swedlow, University of Dundee, Scotland #
Abstract #
Despite significant advances in cell and tissue imaging instrumentation and analysis algorithms, major informatics challenges remain unsolved: file formats are proprietary, facilities to store, analyze and query numerical data or analysis results are not routinely available, integration of new algorithms into proprietary packages is difficult at best, and standards for sharing image data and results are lacking. We have developed an open-source software framework to address these limitations called the Open Microscopy Environment (http://openmicroscopy.org). OME has three components—an open data model for biological imaging, standardised file formats and software libraries for data file conversion and software tools for image data management and analysis.
The OME Data Model (http://openmicroscopy.org/site/support/ome-model/) provides a common specification for scientific image data and has recently been updated to more fully support fluorescence filter sets, the requirement for unique identifiers, screening experiments using multi-well plates.
The OME-TIFF file format (http://openmicroscopy.org/site/support/ome-model/ome-tiff) and the Bio-Formats file format library (http://openmicroscopy.org/site/products/bio-formats) provide an easy-to-use set of tools for converting data from proprietary file formats. These resources enable access to data by different processing and visualization applications, sharing of data between scientific collaborators and interoperability in third party tools like Fiji/ImageJ.
The Java-based OMERO platform (http://openmicroscopy.org/site/products/omero) includes server and client applications that combine an image metadata database, a binary image data repository and visualization and analysis by remote access. The current stable release of OMERO (OMERO-4.4; http://openmicroscopy.org/site/support/omero4/downloads) includes a single mechanism for accessing image data of all types– regardless of original file format– via Java, C/C++ and Python and a variety of applications and environments (e.g., ImageJ, Matlab and CellProfiler). This version of OMERO includes a number of new functions, including SSL-based secure access, distributed compute facility, filesystem access for OMERO clients, and a scripting facility for image processing. An open script repository allows users to share scripts with one another. A permissions system controls access to data within OMERO and enables sharing of data with users in a specific group or even publishing of image data to the worldwide community. Several applications that use OMERO are now released by the OME Consortium, including a FLIM analysis module, an object tracking module, two image-based search applications, an automatic image taggi
Notes #
Representing consortium of 10 different groups US, UK, Europe
Outline:
- Problem,
- 2 possible solutions,
- sharing and publishing data,
- directions,
- imaging community,
- publishing large imaging datasets
Problem #
- image: cancer cell preparing to divide in mitosis.
- In the early days, taking such an image was a big deal - huge improvement. detectors and computation power.
- we take these images and work hard to get them on journal covers
BUT - the most important thing to understand:
- every one of these pixels is a quantitative measurement
- this is a temporally resolved measurement
- easy to generate 50G of data in an afternoon. biologists are enterprise data generators
- trying to use these images as measurement. this data should be a resource - collaboration, release the data to the community
- the image problem is ubiquitous, electron microscopy, physiology, cells, in vivo, pathology, and more -> all major enterprise data generators
- the scientists that use these technologies are not data scientists. they need these kinds of technologies and have ambition to make measurements at scale, but not tools
2 Possible Solutions #
- aspire to build solutions that address all these domains
OME - towards image informatics
- do not create new imaging tools, visualization
- all about interoperability:
- some new imaging modality is developed and can be accessed by existing tools
- new method for image analysis can be run on existing modalities
- modalities are changing so quickly - standards are useless
- no matter what’s coming off this imaging system, some tool will be able to interact
OME - founded over lunch w/ cell bio
- well plates becoming popular
- people making microscope, chemical libraries and cell line -> no one is doing anythign about the data coming off
- partner with other institutions - open source work (GPL license)
- public road mapping, GitHub, continuous integration, Kanban
- release:
- specification for data - OME-TIFF - open image data file
- bio-formats
- Omero, image-data management platform
Open data formats: spend time worry about OME data model (xml based specs for datatypes).
Around image acquisition events itself: model status of detector, lens, etc
Bio-Formats
- simple and tedious: reverse engineer proprietary formats, java lib, read each one convert to common model
- doing this for 10 years
- we get data from the community
- best collection of imaging files in the world: don’t have facilities to do anything other than hold this privately
- installed 65K sites worldwide
- 2 FTEs working on this project
- standardize interface to all formats
OMERO #
- clients on top, servers on bottom
- storage on images - relational for metadata: HDF5 based structure
- text search
- building Omero - solve a problem in a lab at ian institute repository, journal, national repo
- idea is that we have to support as many client architectures as we can. Ice - middleware, used by Skype, great for large data graphs/binary data http://www.zeroc.com/ice.html
- rich java client- Omero insight
- tree based files, thumbnail, region views
- client-server architecture - 300G of data viewed across the wire
- remote-access
- web based view (x-platform)
- high content assays - modelled in data model
- digital pathology - tile based viewer. web based and java based on same api
results
- treat result outputs as an annotation,
- text based indexed with Luecene
- large tabular results - relational HDF5
// accidentally closed my browser…//
Sharing and Publishing data #
- sharing data: e.g. lab web page, few lines of js, embed viewer
- institutional repo: publish paper, release data based on Omero based system
- public resources: compiling dynamic data
- PDB, EMDataBank - publishing with OMERO
Directions #
how do we build an application that can work in a rapidly changing field like imaging?
- leverage the OME model
- meta-compute -
example: using Galaxy, clinical data set - need a metadata management system
- uses Omero underneath to store metadata
- problem: every time there’s a new gene release needs to recalculate. change datamodel to handle metadata. also used Omero for histological images
Uses of Omero
- Omero and ImageJ - plugins
- MATLAB and Omero
- Omero & u-track (custom object tracking software - MATLAB based)
- Omero & FLIMfit - fluorescence lifetime
- Omero.searcher
- Omero & auto-tagging
- user trying to access data - scan data and pick up tags
- figure: when we submit figure to journals, wrestle with adobe illustrator
- always remove from original data structure and create jpeg - loose orignal context
- js based viewer - to keep linkage between representation of data and data itself. figure = js / not tiff
- Omero and bioformats
- data import and access
- digital pathology and hi input screenings
- data will be written once (at multi TB scales) use Omero and pull image off directly - don’t copy data
Imaging Community #
- Annual user meetings
- active community of open source projects
- working towards progress
Publishing Large Imaging Datasets #
publishing image to data: Perkin-Elmer’s columbus - Omero in a box
journal of cell bio - built JCB viewer - js
- large image data
- digital pathology to scale
phenotypic screening - hi content screens
- many TB of data
- published data, all authors call, genomic information
- authors listed free text of phenotypes they saw
- cell phenotype database @ EBI
- combines all publish hi content screens
- take manual author annotations
- create ontology: common way to annotate this data
More datatypes, more storage, more analysis
Q & A #
Q: (Schatz) a number of the image formats are copywrited, etc. What is your experience as you reverse engineer these formats? Legal problems?
A: Almost every commercial vendor, when they build a new imaging system they build a new image format. Just changing now. In general, if you look at the end user license - it will forbid you from reverse engineering. Does not forbid you uploading to us and we reverse engineer it. That’s what we do. Last few years - vendors coming to us - please make sure that this file format is support on the date that we release it. Sometimes they take our metadata specs and drop it into theirs. A lot is opening up and ppl are more willing to work with us.
Q: From a CS lab that does open source dev: you said you release everything GPL. We release everything apache - a lot of people in industry like it better. Why choose GPL? Feedback?
A: Short version: when we started, there wasn’t the richness is licenses. To be blunt, we want people to contribute. As the guy who has to pay an enourmous number of salaries, we’re fine when a company wants to use our software, but we need some way to keep the project going and feed everyone. We get a licensing fee from perkinelmer (closed) to help development.
Short Talk: Decentralizing Image Informatics #
Douglas P.W. Russell, University of Oxford, UK #
department of biochemistry
member of open microscopy consortium
Abstract #
The Open Microscopy Environment (OME; http://openmicroscopy.org) builds software tools that facilitate image informatics. An open file format (OME-TIFF) and software library (Bio-Formats) enable the free access to multidimensional (5D+) image data regardless of software or platform. A data management server (OMERO) provides an image data management solution for labs and institutes by centralizing the storage of image data and providing the biologist a means to manage that data remotely through a multi-platform API. This is made possible by the Bio-Formats library, extracting image metadata into a PostgreSQL database for fast lookup, and multi-zoom image previews enable visual inspection without the cost of transmitting the actual raw data to the user. In addition to the convenience for individual biologists, sharing data with collaborators becomes simpler and avoids data duplication.
Addressing the next scale of data challenges, e.g. at the national or international level, has brought the OME platform up against some hard barriers. Already, the data output of individual imaging systems has grown to the multi-TB level. Integrating multi-TB datasets from dispersed locations, and integrating analysis workflows will soon challenge the basic assumptions that underly a system like OMERO. This is particularly true for automated processing: OMERO.scripts provides a facility for running executables in the locality of the data. The use of ZeroC’s IceGrid permits farming out such tasks in Python, C++, Java, and in OMERO5 even ImageJ2 tasks to nodes which all use the same remote API. However, OMERO does not yet provide a solution for decentralised data and workflow management.
A logical next step for OMERO is to decentralize the data by increasing the proximity of data storage to processing resources, reducing bottlenecks through redundancy, and enabling vast data storage on commodity hardware rather than expensive, enterprise storage.
Notes #
How OMERO can scale with big data, higher demand #
1) as scope and # of users increase, total data increases
- one end: 1 user or small group of users
- a user with minimal amt of sysadmin can instal and get it working
- other end: national resources, institute: need a serious sysadmin team
- tradeoffs:
2) Data set size: hight content screen
- many images, each well, many dimensions
- phenotypic data attached to each well
- links to external genomic resources
- all of this is a huge amount of data. One screen can be TBs in size
Once data is in OMERO - excellent data management tool
- until you get it in there - need to make choices on how to put it in
- smaller scale: input data and archive original image. extract metadata for search
- when analysis needs pixel data - extract at runtime
- in reality - users need access to filesystem where raw data is. Moving data around is unfeasible. now, extract metadata and reference to where raw file is. helps with data duplication problem
- preferable to store data in read optimized format. trade some operation efficiency for some possible data loss
OMERO services #
- all run on Ice
- http://www.zeroc.com/ice.html
- process, indexer, and more - all on ice
Ice gives us the capability to distribute some services to other hosts
- pretty seamless - can take advantage of local compute
- can do this multiple times to access more compute resources
- but then each has to communicate back to original
- >> decentralizing omero
Decentralized
- access data directly - both servers can access resources (filesystem) directly
- once we have that, we can scale - more servers
- this has the potential to address image management on scaled level
- can deploy man Omero components on many hosts - make more powerful, absorb volumes of data
- can take advantage of cloud computing - can scale permanent or temporarily - spin off more hosts
- will be necessary to augment Omero’s resources with distributed filesystems - store huge amounts of pixel or image data
- can also make use of Cassandra clusters - caching frequently accessed data. much bigger scale
That how we’d like to cope with big data in Omero but make it accessible for single user who wants to install it locally
Q & A #
Q: (Schatz) are you considering map-reduce or just storage?
A: we could definitely use them both, yes
Spanning Molecular and Genomic Data in Drug Discovery #
John Overington, European Molecular Biology Laboratory, UK #
Abstract #
The link between biological and chemical worlds is of critical importance in many fields, not least that of healthcare and chemical safety assessment. A major focus in the integrative understanding of biology are genes/proteins and the networks and pathways describing their interactions and functions; similarly, within chemistry there is much interest in efficiently identifying drug-like, cell-penetrant compounds that specifically interact with and modulate these targets. The number of genes of interest is of the range of 105 to 106, which is modest with respect to plausible drug-like chemical space - 1020 to 1060. We have built a public database linking chemical structures (106) to molecular targets (104), covering molecular interactions and pharmacological activities and Absorption, Distribution, Metabolism and Excretion (ADME) properties (http://www.ebi.ac.uk/chembl) in an attempt to map the general features of molecular properties and features important for both small molecule and protein targets in drug discovery. We have then used this empirical kernel of data to extend analysis across the human genome, and to large virtual databases of compound structures - we have also integrated these data with genomics datasets, such as the GWAS catalogue.
Notes #
Chemistry. Mapping of Chemistry - interface of chemistry with genomic and drug discovery data.
Background #
chemical space: how big is the chemical space. GBD-13 - all possible molecules (stable) with up to 13 heavy atoms
- 1B structures
- largest small organic databases
- GDB-17 - 166B structures - not available. Intellectual property issues
not all molecules can be drugs - needs to be bioactive
- physical properties access to ‘target’
- ADMET - absorption, distribution, metabolism, excretion & toxicity
Lipinski - a molecule given these parameters was likely to have good oral drug prop. http://en.wikipedia.org/wiki/Lipinski’s_rule_of_five
- different for topical and parenterally dosed drugs
- pretty good guide
1019-23 libpinski like small molecules - potential drugs
around 21-23 peak in curve - size of heavy atom counts for drugs.
drug discovery - making molecules slightly larger than they need to be
GDB - 30% of all known drugs ?
Targets : homo sapients 21K genes.
Only 1% of genome is a drug target - we’ve been able to develop drugs against.
we’ve tried many many more
Chemogenomics = chemistry + genome derived objects #
- exploration of small molecule bioactiviy space at genomic scale
- possible space: 106 (targets), drug target proteins 102
- drugs: all reasonable 1022, screened: 107
- similar compound structures have similar functions
ChEMBL - training set; largest db of medicinal chemistry data 1.4M compounds #
- adding plant data later this year
- open
- download/access - db dumps, semantic web rdf - SPARQL, virtualization (ChEMBL appliances)
- ChEMpi - raspberry pi
- data comes from the literature - extract structures fromteh text, link to assays, link to sequence, store functional data. allows to chain targets to phenotypic effects
- quantitative data
- target types: single gene - all the way to - organisms
- compound searching - matching structure space (2D blast)
different drug structures - ligand efficiency
- drugs are efficient, every atom counts - avoid lipophilicity
- interested in balance between binding efficiency and molecular size
- target class data
assay organism data
- differences between animal model and the effects of compounds in humans
- failure in pre-clinic - works in animal models, but not humans
- trying to understand systematic reasons
SureChEMBL - acquired SureChem
- new public chemistry
- extends coverage of chemical structures from full-text patent 15M structures
- add target, sequences, disease, animal model, cell-line
Compound Integration
- ChEMBL - literature
- SureChEMBL- patent
Different Types of Drugs
- 2/3 drugs are small molecules
- in late stage development - majority are small molecules
- Therefore, focus on small molecules for drug discovery
Visualizations #
- Polypharmacology via binding sites: majority of pharmacological activity focused on brain
- Affinity of drugs for ‘Targets’: drugs are weaker than we think - penalty for tight binding drugs
- Clinical Candidates: coverage of clinical development candidates -
- Selectivity - circos plot: map promiscuity across tree
Pharma Productivity problem #
- biotech boom
- productivity has fallen off the cliff
how many compounds does a company need to make before they develop a compound
- 100K compounds synthesized to develop drug
- now 32x that to get a potential drug
- Now: pharma needs an average to synth and test 250K compounds for each launched drug. not sustainable
- Trying to be smarter, use db, to help with this
Cancer Drugs and Targets #
- taking ChEMBL and thinking of drug discovery in a cancer setting
- huge investment in genomic studies looking for genomic variation - causes of cancer. sequencing, find driver genes, look at other datasets, find overlaps
come out with option of potential targets
- how do you select from these?
- we can compare against things we had in the past
- majority of the success from the past we would not have discovered using genomic sequencing techniques
- canSAR - large scale integration of public and propriety data built on top of ChEMBL - select compounds likely to be good https://cansar.icr.ac.uk/
Q & A #
Q: (Ouellette) finding out the chemical structures of various organisms; What about Micro-biome space?
A: Different animals have got different physical space for drugs they like. Controversy in literature - physical space for antibiotics. Micro-biome - fascinating - orally, also bacteria and guts. Effect of microbiome by gut bacteria - sometimes needed to activate substance
Q: (Stein) Curious about 1B+ compounds in GBD-17. Can’t release because of IP? Algorithm or structures?
A: Just too big. Drug discovery community -
Can publish the structures of all possible drugs=> can’t patent that - so will destroy all possible intellecual property.
Q: For compounds w/ rich sequence information (transcriptome wide/proteomic) is it integrated?
A: yes and no, transcript microarray data goes in GO or express. Links - compounds in ChEMBL. Reality - very small numbers right now. ChEMBL part of a suite of resources at EBI, link to other resources.
Q: Is there a way through ChEMBL to discover drugs that are potentially synergistic? Drugs with same structures and hit same targets. Connectivity map? X-ref between ChEMBL and connectivity map?
A: One of the most common uses of ChEMBL. combine drugs against the same targets. No links to connectivity map - people have done that.
Other posts in this series: #
- Big Data in Biology
- Big Data in Biology: Large-scale Cancer Genomics
- Big Data in Biology: Databases and Clouds
- Big Data in Biology: Big Data Challenges and Solutions: Control Access to Individual Genomes
- Big Data in Biology: Personal Genomes
- Big Data in Biology: The Next 10 Years of Quantitative Biology