April 1, 2014

Big Data in Biology: Imaging/Parmacogenomics

Series Introduction: I attended the Keystone Symposia Conference: Big Data in Biology as the Conference Assistant last week. I set up an Etherpad during the meeting to take live notes during the sessions. I’ve compiled all the abstracts, notes and slides (where available) here. Shout-out to David Kuo for helping edit the notes.

Warning: These notes are somewhat incomplete and mostly written in broken english

Imaging/Parmacogenomics #

Tuesday, March 25th, 2014 1:00pm - 3:00pm
http://ks.eventmobi.com/14f2/agenda/35704/288362

Speaker list #

Susan Sunkin, Allen Institute for Brain Science, USA

Allen Brain Atlas: An Integrated Neuroscience Resource -
[Abstract]
[Q&A]

Jason R. Swedlow, University of Dundee, Scotland

The Open Microscopy Environment: Open Source Image Informatics for the Biological Sciences -
[Abstract]
[Q&A]

Douglas P. W. Russell, University of Oxford, UK

Short Talk: Decentralizing Image Informatics -
[Abstract]
[Q&A]

John Overington, European Molecular Biology Laboratory, UK

Spanning Molecular and Genomic Data in Drug Discovery -
[Abstract]
[Q&A]

Allen Brain Atlas: An Integrated Neuroscience Resource #

Susan Sunkin, Allen Institute for Brain Science, USA #

Abstract #

The Allen Brain Atlas (www.brain-map.org) is a collection of open public resources (2 PB of raw data, >3,000,000 images) integrating high-resolution gene expression, structural connectivity, and neuroanatomical data with annotated brain structures, offering whole-brain and genome-wide coverage. The eight major resources currently available span across species (mouse, monkey and human) and development. In mouse, gene expression data covers the entire brain and spinal cord at multiple developmental time points through adult. Mouse data also includes brain-wide long-range axonal projections in the adult mouse as part of the Allen Mouse Brain Connectivity Atlas.

Complementing the mouse atlases, there are four human and non-human primate atlases. The Allen Human Brain Atlas, the NIH-funded BrainSpan Atlas of the Developing Human Brain, and the NIH Blueprint NHP Atlas contain genome-wide gene expression data (microarray and/or RNA sequencing) and high-resolution in situ hybridization (ISH) data for selected sets of genes and brain regions across human and non-human primate development and/or in adult. In addition, the Ben and Catherine Ivy Foundation-funded funded Ivy Glioblastoma Atlas Project contains gene expression data in human glioblastoma.

While the Allen Brain Atlas data portal serves as the entry point and enables searches across data sets, each atlas has its own web application and specialized search and visualization tools that maximize the scientific value of those data sets. Tools include gene searches; ISH image viewers and graphical displays; microarray and RNA sequencing data viewers; Brain Explorer® software for 3D navigation and visualization of gene expression, connectivity and anatomy; and an interactive reference atlas viewer. For the mouse, integrated search and visualization is through automated signal quantification and mapping to a common reference framework. In addition, cross data set searches enable users to query multiple Allen Brain Atlas data sets simultaneously.

Notes #

10 years of work and >200 ppl contribution.

Allen Institute: primarily studying mouse & human #

largest publicly available neuroscience resource
gene expression to connectivity, cell type and circuitry
RNA-Seq
generated in standardized manner then mapped to framework
generated 3PB of data
mouse brain atlas - mouse spinal cord, mouse developing, then human brain, human dev brain
all data accessed through data portal http://www.brain-map.org/

Allen mouse brain Atlas #

genome wide cellular resolution atlas of gene expression in adult mouse brain - in situ hybridization
20K genes surveyed
informatics goals: aid search, navigation and visualization (make it easy to find what you’re looking for)

informatics pipeline: broken down to

preprossing
detection
alignment: mapped to 3d space- > where expression and how much in brain
griding
search
production - very product focused. Publicly available. Mine data and ask biological questions.. end with expression data matrix

Tools to harness data generated from the pipeline

3d viewing tool to view neuro-anatomy and 3d gene expression for one or multiple experiments
gene expression summaries
synchronization feature- same location different experiments
image tool etv- higher resolution image viewer. interactive 3d representation. probe and gene data available. histogram of expression energy. nice snapshot of expression, decide if they’ll do a deeper dive in info
Reference atlas -
- structure ontology
- anontated reference atlas place
- can look at experimental image and look up regions
grid data search - users can search over 25K datasets to find genes with specific expresion pattern
- differential search: high expression in one set (target) compared to contrast
- correlative search: find genes with similar spatial expression profile

Developing mouse brain atlas #

build on allen mouse brain atlas
pick genes for neural development
use reference atlas
create of 3d and 4d tools and data analysis
high qualitiy specimens selected, stained, generate images, annotate regions, make 2d and 3d output (adobe illustrator)
Search and analysis tools - pick 2d images and get extrapolated 3d expression
Imaging synchronization feature - variety of transcription factor targets
- select location as seed object
- will snychronize all the images you are looking at to the same location

Allen mouse connectivity atlas

high res map of neural connections in whole mouse brain. generate comprehensive db of neural projections. generate 140images per specimen at 100 micron intervals
one mouse brain after injected is embedded and placeed on stage two photon images taken, then brain moved over and section slice taken off. then another image taken. block face imaging throughout the entire brain
looking at fluorescent projects
spacially map brain to 3d reference model
comprehensive coverage for projection mapping - wt mouse but interested in cell type. projection profiling with cree-driver (sp?) mice
can look at trajectory and topography

Other tools - brain wide data - can pin point region of interest adn dive deeper

Allen Human Brain Atlas #

all genes - all structures. classical histology and neuroanatomy
cellular resolution data - scale. only looked at a subset of genes on a subset of structures (very question driven, autism, schizophrenia, etc)
not possible to process whole genome brain. generate large slabs - create a jigsaw puzzle and assemble at the end
generate histology data, neuranatomical regions of interest generated
LIMS system to assemble the puzzle
structural ontology - to generate summary stats
Search: search by gene or structure, neuroblast correlative search, differential serach
3D brain explorer
Tissue acquisition processing. postmortem brains. no neuropsychiactric disorder
MR Registration volume renderings: rigid and non-rigit registering had to be done
tissue sampling: slabs partitioned, sectioned and map back in MR space
tissue block to MR Registration: place landmarks on scans matched with corresponding image in 3d space

Developing human Brain project #

four main components

developmental transcriptome
prenatal microarray: hi res, 300 distinct structures
ISH: just a subset of regions/genes
reference atlases: few generated for this project (prenatal and adults), include histology and imaging data

Prenatal - LMD Microarray Data

fresh tissue frozen and slabbed
histology determines regions of interest
sent for hybridization to Agilent microarrays. same as adult data for x-comparison
display with online tool: anatomical view and heat map view

Q & A #

Q: (Stein) interested in how labour intensive human tissue blocks were- were the markers placed by hand?

A: Not for every Z level of the MRI, but yes labour intensive. Many steps in order to use the automated pipeline.

Q: (Schatz) at CSHL big study in exome sequencing - which of these genes are expressed in brain at various levels of development?

A: Use our API to pull out data from different datasets to produce that.

Q: Different imaging methods and approaches - what’s the Allen’s approach to presenting the information in some way that could be queries at different domains and at the cell level?

A: The level of registration is not down to cell - it’s domains.

back to the speaker list →

The Open Microscopy Environment: Open Source Image Informatics for the Biological Sciences #

Jason R. Swedlow, University of Dundee, Scotland #

Abstract #

Despite significant advances in cell and tissue imaging instrumentation and analysis algorithms, major informatics challenges remain unsolved: file formats are proprietary, facilities to store, analyze and query numerical data or analysis results are not routinely available, integration of new algorithms into proprietary packages is difficult at best, and standards for sharing image data and results are lacking. We have developed an open-source software framework to address these limitations called the Open Microscopy Environment (http://openmicroscopy.org). OME has three components—an open data model for biological imaging, standardised file formats and software libraries for data file conversion and software tools for image data management and analysis.

The OME Data Model (http://openmicroscopy.org/site/support/ome-model/) provides a common specification for scientific image data and has recently been updated to more fully support fluorescence filter sets, the requirement for unique identifiers, screening experiments using multi-well plates.

The OME-TIFF file format (http://openmicroscopy.org/site/support/ome-model/ome-tiff) and the Bio-Formats file format library (http://openmicroscopy.org/site/products/bio-formats) provide an easy-to-use set of tools for converting data from proprietary file formats. These resources enable access to data by different processing and visualization applications, sharing of data between scientific collaborators and interoperability in third party tools like Fiji/ImageJ.

The Java-based OMERO platform (http://openmicroscopy.org/site/products/omero) includes server and client applications that combine an image metadata database, a binary image data repository and visualization and analysis by remote access. The current stable release of OMERO (OMERO-4.4; http://openmicroscopy.org/site/support/omero4/downloads) includes a single mechanism for accessing image data of all types– regardless of original file format– via Java, C/C++ and Python and a variety of applications and environments (e.g., ImageJ, Matlab and CellProfiler). This version of OMERO includes a number of new functions, including SSL-based secure access, distributed compute facility, filesystem access for OMERO clients, and a scripting facility for image processing. An open script repository allows users to share scripts with one another. A permissions system controls access to data within OMERO and enables sharing of data with users in a specific group or even publishing of image data to the worldwide community. Several applications that use OMERO are now released by the OME Consortium, including a FLIM analysis module, an object tracking module, two image-based search applications, an automatic image taggi

Notes #

Representing consortium of 10 different groups US, UK, Europe
Outline:

Problem,
2 possible solutions,
sharing and publishing data,
directions,
imaging community,
publishing large imaging datasets

Problem #

image: cancer cell preparing to divide in mitosis.
In the early days, taking such an image was a big deal - huge improvement. detectors and computation power.
we take these images and work hard to get them on journal covers

BUT - the most important thing to understand:

every one of these pixels is a quantitative measurement
this is a temporally resolved measurement
easy to generate 50G of data in an afternoon. biologists are enterprise data generators
trying to use these images as measurement. this data should be a resource - collaboration, release the data to the community
the image problem is ubiquitous, electron microscopy, physiology, cells, in vivo, pathology, and more -> all major enterprise data generators
the scientists that use these technologies are not data scientists. they need these kinds of technologies and have ambition to make measurements at scale, but not tools

2 Possible Solutions #

aspire to build solutions that address all these domains

OME - towards image informatics

do not create new imaging tools, visualization
all about interoperability:
- some new imaging modality is developed and can be accessed by existing tools
- new method for image analysis can be run on existing modalities
- modalities are changing so quickly - standards are useless
- no matter what’s coming off this imaging system, some tool will be able to interact

OME - founded over lunch w/ cell bio

well plates becoming popular
people making microscope, chemical libraries and cell line -> no one is doing anythign about the data coming off
partner with other institutions - open source work (GPL license)
public road mapping, GitHub, continuous integration, Kanban
release:
- specification for data - OME-TIFF - open image data file
- bio-formats
- Omero, image-data management platform

Open data formats: spend time worry about OME data model (xml based specs for datatypes).
Around image acquisition events itself: model status of detector, lens, etc

Bio-Formats

simple and tedious: reverse engineer proprietary formats, java lib, read each one convert to common model
doing this for 10 years
we get data from the community
best collection of imaging files in the world: don’t have facilities to do anything other than hold this privately
installed 65K sites worldwide
2 FTEs working on this project
standardize interface to all formats

OMERO #

clients on top, servers on bottom
storage on images - relational for metadata: HDF5 based structure
text search
building Omero - solve a problem in a lab at ian institute repository, journal, national repo
idea is that we have to support as many client architectures as we can. Ice - middleware, used by Skype, great for large data graphs/binary data http://www.zeroc.com/ice.html
rich java client- Omero insight
- tree based files, thumbnail, region views
- client-server architecture - 300G of data viewed across the wire
- remote-access
web based view (x-platform)
high content assays - modelled in data model
digital pathology - tile based viewer. web based and java based on same api

results

treat result outputs as an annotation,
text based indexed with Luecene
large tabular results - relational HDF5

// accidentally closed my browser…//

sharing data: e.g. lab web page, few lines of js, embed viewer
institutional repo: publish paper, release data based on Omero based system
public resources: compiling dynamic data
PDB, EMDataBank - publishing with OMERO

Directions #

how do we build an application that can work in a rapidly changing field like imaging?

leverage the OME model
meta-compute -

example: using Galaxy, clinical data set - need a metadata management system

uses Omero underneath to store metadata
problem: every time there’s a new gene release needs to recalculate. change datamodel to handle metadata. also used Omero for histological images

Uses of Omero

Omero and ImageJ - plugins
MATLAB and Omero
Omero & u-track (custom object tracking software - MATLAB based)
Omero & FLIMfit - fluorescence lifetime
Omero.searcher
Omero & auto-tagging
- user trying to access data - scan data and pick up tags
- figure: when we submit figure to journals, wrestle with adobe illustrator
- always remove from original data structure and create jpeg - loose orignal context
- js based viewer - to keep linkage between representation of data and data itself. figure = js / not tiff
Omero and bioformats
- data import and access
- digital pathology and hi input screenings
- data will be written once (at multi TB scales) use Omero and pull image off directly - don’t copy data

Imaging Community #

Annual user meetings
active community of open source projects
working towards progress

Publishing Large Imaging Datasets #

publishing image to data: Perkin-Elmer’s columbus - Omero in a box

journal of cell bio - built JCB viewer - js

large image data
digital pathology to scale

phenotypic screening - hi content screens

many TB of data
published data, all authors call, genomic information
authors listed free text of phenotypes they saw
cell phenotype database @ EBI
- combines all publish hi content screens
- take manual author annotations
- create ontology: common way to annotate this data

More datatypes, more storage, more analysis

Q & A #

Q: (Schatz) a number of the image formats are copywrited, etc. What is your experience as you reverse engineer these formats? Legal problems?

A: Almost every commercial vendor, when they build a new imaging system they build a new image format. Just changing now. In general, if you look at the end user license - it will forbid you from reverse engineering. Does not forbid you uploading to us and we reverse engineer it. That’s what we do. Last few years - vendors coming to us - please make sure that this file format is support on the date that we release it. Sometimes they take our metadata specs and drop it into theirs. A lot is opening up and ppl are more willing to work with us.

Q: From a CS lab that does open source dev: you said you release everything GPL. We release everything apache - a lot of people in industry like it better. Why choose GPL? Feedback?

A: Short version: when we started, there wasn’t the richness is licenses. To be blunt, we want people to contribute. As the guy who has to pay an enourmous number of salaries, we’re fine when a company wants to use our software, but we need some way to keep the project going and feed everyone. We get a licensing fee from perkinelmer (closed) to help development.

back to the speaker list →

Short Talk: Decentralizing Image Informatics #

Douglas P.W. Russell, University of Oxford, UK #

department of biochemistry
member of open microscopy consortium

Abstract #

The Open Microscopy Environment (OME; http://openmicroscopy.org) builds software tools that facilitate image informatics. An open file format (OME-TIFF) and software library (Bio-Formats) enable the free access to multidimensional (5D+) image data regardless of software or platform. A data management server (OMERO) provides an image data management solution for labs and institutes by centralizing the storage of image data and providing the biologist a means to manage that data remotely through a multi-platform API. This is made possible by the Bio-Formats library, extracting image metadata into a PostgreSQL database for fast lookup, and multi-zoom image previews enable visual inspection without the cost of transmitting the actual raw data to the user. In addition to the convenience for individual biologists, sharing data with collaborators becomes simpler and avoids data duplication.

Addressing the next scale of data challenges, e.g. at the national or international level, has brought the OME platform up against some hard barriers. Already, the data output of individual imaging systems has grown to the multi-TB level. Integrating multi-TB datasets from dispersed locations, and integrating analysis workflows will soon challenge the basic assumptions that underly a system like OMERO. This is particularly true for automated processing: OMERO.scripts provides a facility for running executables in the locality of the data. The use of ZeroC’s IceGrid permits farming out such tasks in Python, C++, Java, and in OMERO5 even ImageJ2 tasks to nodes which all use the same remote API. However, OMERO does not yet provide a solution for decentralised data and workflow management.

A logical next step for OMERO is to decentralize the data by increasing the proximity of data storage to processing resources, reducing bottlenecks through redundancy, and enabling vast data storage on commodity hardware rather than expensive, enterprise storage.

Notes #

How OMERO can scale with big data, higher demand #

1) as scope and # of users increase, total data increases

one end: 1 user or small group of users
a user with minimal amt of sysadmin can instal and get it working
other end: national resources, institute: need a serious sysadmin team
tradeoffs:

2) Data set size: hight content screen

many images, each well, many dimensions
phenotypic data attached to each well
links to external genomic resources
all of this is a huge amount of data. One screen can be TBs in size

Once data is in OMERO - excellent data management tool

until you get it in there - need to make choices on how to put it in
smaller scale: input data and archive original image. extract metadata for search
when analysis needs pixel data - extract at runtime
in reality - users need access to filesystem where raw data is. Moving data around is unfeasible. now, extract metadata and reference to where raw file is. helps with data duplication problem
preferable to store data in read optimized format. trade some operation efficiency for some possible data loss

OMERO services #

all run on Ice
http://www.zeroc.com/ice.html
process, indexer, and more - all on ice

Ice gives us the capability to distribute some services to other hosts

pretty seamless - can take advantage of local compute
can do this multiple times to access more compute resources
but then each has to communicate back to original
>> decentralizing omero

Decentralized

access data directly - both servers can access resources (filesystem) directly
once we have that, we can scale - more servers
this has the potential to address image management on scaled level
can deploy man Omero components on many hosts - make more powerful, absorb volumes of data
can take advantage of cloud computing - can scale permanent or temporarily - spin off more hosts
will be necessary to augment Omero’s resources with distributed filesystems - store huge amounts of pixel or image data
can also make use of Cassandra clusters - caching frequently accessed data. much bigger scale

That how we’d like to cope with big data in Omero but make it accessible for single user who wants to install it locally

github.com/openmicroscopy

Q & A #

Q: (Schatz) are you considering map-reduce or just storage?

A: we could definitely use them both, yes

back to the speaker list →

Spanning Molecular and Genomic Data in Drug Discovery #

John Overington, European Molecular Biology Laboratory, UK #

Abstract #

The link between biological and chemical worlds is of critical importance in many fields, not least that of healthcare and chemical safety assessment. A major focus in the integrative understanding of biology are genes/proteins and the networks and pathways describing their interactions and functions; similarly, within chemistry there is much interest in efficiently identifying drug-like, cell-penetrant compounds that specifically interact with and modulate these targets. The number of genes of interest is of the range of 105 to 106, which is modest with respect to plausible drug-like chemical space - 1020 to 1060. We have built a public database linking chemical structures (106) to molecular targets (104), covering molecular interactions and pharmacological activities and Absorption, Distribution, Metabolism and Excretion (ADME) properties (http://www.ebi.ac.uk/chembl) in an attempt to map the general features of molecular properties and features important for both small molecule and protein targets in drug discovery. We have then used this empirical kernel of data to extend analysis across the human genome, and to large virtual databases of compound structures - we have also integrated these data with genomics datasets, such as the GWAS catalogue.

Notes #

Chemistry. Mapping of Chemistry - interface of chemistry with genomic and drug discovery data.

Background #

chemical space: how big is the chemical space. GBD-13 - all possible molecules (stable) with up to 13 heavy atoms

1B structures
largest small organic databases
GDB-17 - 166B structures - not available. Intellectual property issues

not all molecules can be drugs - needs to be bioactive

physical properties access to ‘target’
ADMET - absorption, distribution, metabolism, excretion & toxicity

Lipinski - a molecule given these parameters was likely to have good oral drug prop. http://en.wikipedia.org/wiki/Lipinski’s_rule_of_five

different for topical and parenterally dosed drugs
pretty good guide

10^19-23 libpinski like small molecules - potential drugs

around 21-23 peak in curve - size of heavy atom counts for drugs.
drug discovery - making molecules slightly larger than they need to be

GDB - 30% of all known drugs ?

Targets : homo sapients 21K genes.
Only 1% of genome is a drug target - we’ve been able to develop drugs against.
we’ve tried many many more

Chemogenomics = chemistry + genome derived objects #

exploration of small molecule bioactiviy space at genomic scale
possible space: 10⁶ (targets), drug target proteins 10²
drugs: all reasonable 10^22, screened: 10⁷
similar compound structures have similar functions

ChEMBL - training set; largest db of medicinal chemistry data 1.4M compounds #

adding plant data later this year
open
download/access - db dumps, semantic web rdf - SPARQL, virtualization (ChEMBL appliances)
ChEMpi - raspberry pi
data comes from the literature - extract structures fromteh text, link to assays, link to sequence, store functional data. allows to chain targets to phenotypic effects
quantitative data
target types: single gene - all the way to - organisms
compound searching - matching structure space (2D blast)

different drug structures - ligand efficiency

drugs are efficient, every atom counts - avoid lipophilicity
interested in balance between binding efficiency and molecular size
target class data

assay organism data

differences between animal model and the effects of compounds in humans
failure in pre-clinic - works in animal models, but not humans
trying to understand systematic reasons

SureChEMBL - acquired SureChem

new public chemistry
extends coverage of chemical structures from full-text patent 15M structures
add target, sequences, disease, animal model, cell-line

Compound Integration

ChEMBL - literature
SureChEMBL- patent

Different Types of Drugs

2/3 drugs are small molecules
in late stage development - majority are small molecules
Therefore, focus on small molecules for drug discovery

Visualizations #

Polypharmacology via binding sites: majority of pharmacological activity focused on brain
Affinity of drugs for ‘Targets’: drugs are weaker than we think - penalty for tight binding drugs
Clinical Candidates: coverage of clinical development candidates -
Selectivity - circos plot: map promiscuity across tree

Pharma Productivity problem #

biotech boom
productivity has fallen off the cliff

how many compounds does a company need to make before they develop a compound

100K compounds synthesized to develop drug
now 32x that to get a potential drug
Now: pharma needs an average to synth and test 250K compounds for each launched drug. not sustainable
Trying to be smarter, use db, to help with this

Cancer Drugs and Targets #

taking ChEMBL and thinking of drug discovery in a cancer setting
huge investment in genomic studies looking for genomic variation - causes of cancer. sequencing, find driver genes, look at other datasets, find overlaps

come out with option of potential targets

how do you select from these?
we can compare against things we had in the past
majority of the success from the past we would not have discovered using genomic sequencing techniques
canSAR - large scale integration of public and propriety data built on top of ChEMBL - select compounds likely to be good https://cansar.icr.ac.uk/

Q & A #

Q: (Ouellette) finding out the chemical structures of various organisms; What about Micro-biome space?

A: Different animals have got different physical space for drugs they like. Controversy in literature - physical space for antibiotics. Micro-biome - fascinating - orally, also bacteria and guts. Effect of microbiome by gut bacteria - sometimes needed to activate substance

Q: (Stein) Curious about 1B+ compounds in GBD-17. Can’t release because of IP? Algorithm or structures?

A: Just too big. Drug discovery community -
Can publish the structures of all possible drugs=> can’t patent that - so will destroy all possible intellecual property.

Q: For compounds w/ rich sequence information (transcriptome wide/proteomic) is it integrated?

A: yes and no, transcript microarray data goes in GO or express. Links - compounds in ChEMBL. Reality - very small numbers right now. ChEMBL part of a suite of resources at EBI, link to other resources.

Q: Is there a way through ChEMBL to discover drugs that are potentially synergistic? Drugs with same structures and hit same targets. Connectivity map? X-ref between ChEMBL and connectivity map?

A: One of the most common uses of ChEMBL. combine drugs against the same targets. No links to connectivity map - people have done that.

back to the speaker list →

Other posts in this series: #

Imaging/Parmacogenomics #

Susan Sunkin, Allen Institute for Brain Science, USA #

Notes #

Allen Institute: primarily studying mouse & human #

Allen mouse brain Atlas #

Developing mouse brain atlas #

Allen Human Brain Atlas #

Developing human Brain project #

Jason R. Swedlow, University of Dundee, Scotland #

Notes #

Problem #

2 Possible Solutions #

OMERO #

Sharing and Publishing data #

Directions #

Imaging Community #

Publishing Large Imaging Datasets #

Douglas P.W. Russell, University of Oxford, UK #

Notes #

How OMERO can scale with big data, higher demand #

OMERO services #

John Overington, European Molecular Biology Laboratory, UK #

Notes #

Background #

Chemogenomics = chemistry + genome derived objects #

ChEMBL - training set; largest db of medicinal chemistry data 1.4M compounds #

Visualizations #

Pharma Productivity problem #

Cancer Drugs and Targets #

Other posts in this series: #

Now read this

Big Data in Biology: Personal Genomes