Big Data in Biology: Imaging/Parmacogenomics

Series Introduction: I attended the Keystone Symposia Conference: Big Data in Biology as the Conference Assistant last week. I set up an Etherpad during the meeting to take live notes during the sessions. I’ve compiled all the abstracts, notes and slides (where available) here. Shout-out to David Kuo for helping edit the notes.

Warning: These notes are somewhat incomplete and mostly written in broken english

Imaging/Parmacogenomics #

Tuesday, March 25th, 2014 1:00pm - 3:00pm

Speaker list #

Susan Sunkin, Allen Institute for Brain Science, USA

Allen Brain Atlas: An Integrated Neuroscience Resource -

Jason R. Swedlow, University of Dundee, Scotland

The Open Microscopy Environment: Open Source Image Informatics for the Biological Sciences -

Douglas P. W. Russell, University of Oxford, UK

Short Talk: Decentralizing Image Informatics -

John Overington, European Molecular Biology Laboratory, UK

Spanning Molecular and Genomic Data in Drug Discovery -

Allen Brain Atlas: An Integrated Neuroscience Resource #

Susan Sunkin, Allen Institute for Brain Science, USA #

Abstract #

The Allen Brain Atlas ( is a collection of open public resources (2 PB of raw data, >3,000,000 images) integrating high-resolution gene expression, structural connectivity, and neuroanatomical data with annotated brain structures, offering whole-brain and genome-wide coverage. The eight major resources currently available span across species (mouse, monkey and human) and development. In mouse, gene expression data covers the entire brain and spinal cord at multiple developmental time points through adult. Mouse data also includes brain-wide long-range axonal projections in the adult mouse as part of the Allen Mouse Brain Connectivity Atlas.

Complementing the mouse atlases, there are four human and non-human primate atlases. The Allen Human Brain Atlas, the NIH-funded BrainSpan Atlas of the Developing Human Brain, and the NIH Blueprint NHP Atlas contain genome-wide gene expression data (microarray and/or RNA sequencing) and high-resolution in situ hybridization (ISH) data for selected sets of genes and brain regions across human and non-human primate development and/or in adult. In addition, the Ben and Catherine Ivy Foundation-funded funded Ivy Glioblastoma Atlas Project contains gene expression data in human glioblastoma.

While the Allen Brain Atlas data portal serves as the entry point and enables searches across data sets, each atlas has its own web application and specialized search and visualization tools that maximize the scientific value of those data sets. Tools include gene searches; ISH image viewers and graphical displays; microarray and RNA sequencing data viewers; Brain Explorer® software for 3D navigation and visualization of gene expression, connectivity and anatomy; and an interactive reference atlas viewer. For the mouse, integrated search and visualization is through automated signal quantification and mapping to a common reference framework. In addition, cross data set searches enable users to query multiple Allen Brain Atlas data sets simultaneously.

Notes #

10 years of work and >200 ppl contribution.

Allen Institute: primarily studying mouse & human #

Allen mouse brain Atlas #

informatics pipeline: broken down to

Tools to harness data generated from the pipeline

Developing mouse brain atlas #

Allen mouse connectivity atlas

Other tools - brain wide data - can pin point region of interest adn dive deeper

Allen Human Brain Atlas #

Developing human Brain project #

four main components

  1. developmental transcriptome
  2. prenatal microarray: hi res, 300 distinct structures
  3. ISH: just a subset of regions/genes
  4. reference atlases: few generated for this project (prenatal and adults), include histology and imaging data

Prenatal - LMD Microarray Data

Q & A #

Q: (Stein) interested in how labour intensive human tissue blocks were- were the markers placed by hand?

A: Not for every Z level of the MRI, but yes labour intensive. Many steps in order to use the automated pipeline.

Q: (Schatz) at CSHL big study in exome sequencing - which of these genes are expressed in brain at various levels of development?

A: Use our API to pull out data from different datasets to produce that.

Q: Different imaging methods and approaches - what’s the Allen’s approach to presenting the information in some way that could be queries at different domains and at the cell level?

A: The level of registration is not down to cell - it’s domains.

back to the speaker list →

The Open Microscopy Environment: Open Source Image Informatics for the Biological Sciences #

Jason R. Swedlow, University of Dundee, Scotland #

Abstract #

Despite significant advances in cell and tissue imaging instrumentation and analysis algorithms, major informatics challenges remain unsolved: file formats are proprietary, facilities to store, analyze and query numerical data or analysis results are not routinely available, integration of new algorithms into proprietary packages is difficult at best, and standards for sharing image data and results are lacking. We have developed an open-source software framework to address these limitations called the Open Microscopy Environment ( OME has three components—an open data model for biological imaging, standardised file formats and software libraries for data file conversion and software tools for image data management and analysis.

The OME Data Model ( provides a common specification for scientific image data and has recently been updated to more fully support fluorescence filter sets, the requirement for unique identifiers, screening experiments using multi-well plates.

The OME-TIFF file format ( and the Bio-Formats file format library ( provide an easy-to-use set of tools for converting data from proprietary file formats. These resources enable access to data by different processing and visualization applications, sharing of data between scientific collaborators and interoperability in third party tools like Fiji/ImageJ.

The Java-based OMERO platform ( includes server and client applications that combine an image metadata database, a binary image data repository and visualization and analysis by remote access. The current stable release of OMERO (OMERO-4.4; includes a single mechanism for accessing image data of all types– regardless of original file format– via Java, C/C++ and Python and a variety of applications and environments (e.g., ImageJ, Matlab and CellProfiler). This version of OMERO includes a number of new functions, including SSL-based secure access, distributed compute facility, filesystem access for OMERO clients, and a scripting facility for image processing. An open script repository allows users to share scripts with one another. A permissions system controls access to data within OMERO and enables sharing of data with users in a specific group or even publishing of image data to the worldwide community. Several applications that use OMERO are now released by the OME Consortium, including a FLIM analysis module, an object tracking module, two image-based search applications, an automatic image taggi

Notes #

Representing consortium of 10 different groups US, UK, Europe

Problem #

BUT - the most important thing to understand:

2 Possible Solutions #

OME - towards image informatics

OME - founded over lunch w/ cell bio

Open data formats: spend time worry about OME data model (xml based specs for datatypes).
Around image acquisition events itself: model status of detector, lens, etc




  1. treat result outputs as an annotation,
  2. text based indexed with Luecene
  3. large tabular results - relational HDF5

// accidentally closed my browser…//

Sharing and Publishing data #

Directions #

how do we build an application that can work in a rapidly changing field like imaging?

example: using Galaxy, clinical data set - need a metadata management system

Uses of Omero

Imaging Community #

Publishing Large Imaging Datasets #

publishing image to data: Perkin-Elmer’s columbus - Omero in a box

journal of cell bio - built JCB viewer - js

phenotypic screening - hi content screens

More datatypes, more storage, more analysis

Q & A #

Q: (Schatz) a number of the image formats are copywrited, etc. What is your experience as you reverse engineer these formats? Legal problems?

A: Almost every commercial vendor, when they build a new imaging system they build a new image format. Just changing now. In general, if you look at the end user license - it will forbid you from reverse engineering. Does not forbid you uploading to us and we reverse engineer it. That’s what we do. Last few years - vendors coming to us - please make sure that this file format is support on the date that we release it. Sometimes they take our metadata specs and drop it into theirs. A lot is opening up and ppl are more willing to work with us.

Q: From a CS lab that does open source dev: you said you release everything GPL. We release everything apache - a lot of people in industry like it better. Why choose GPL? Feedback?

A: Short version: when we started, there wasn’t the richness is licenses. To be blunt, we want people to contribute. As the guy who has to pay an enourmous number of salaries, we’re fine when a company wants to use our software, but we need some way to keep the project going and feed everyone. We get a licensing fee from perkinelmer (closed) to help development.

back to the speaker list →

Short Talk: Decentralizing Image Informatics #

Douglas P.W. Russell, University of Oxford, UK #

department of biochemistry
member of open microscopy consortium

Abstract #

The Open Microscopy Environment (OME; builds software tools that facilitate image informatics. An open file format (OME-TIFF) and software library (Bio-Formats) enable the free access to multidimensional (5D+) image data regardless of software or platform. A data management server (OMERO) provides an image data management solution for labs and institutes by centralizing the storage of image data and providing the biologist a means to manage that data remotely through a multi-platform API. This is made possible by the Bio-Formats library, extracting image metadata into a PostgreSQL database for fast lookup, and multi-zoom image previews enable visual inspection without the cost of transmitting the actual raw data to the user. In addition to the convenience for individual biologists, sharing data with collaborators becomes simpler and avoids data duplication.

Addressing the next scale of data challenges, e.g. at the national or international level, has brought the OME platform up against some hard barriers. Already, the data output of individual imaging systems has grown to the multi-TB level. Integrating multi-TB datasets from dispersed locations, and integrating analysis workflows will soon challenge the basic assumptions that underly a system like OMERO. This is particularly true for automated processing: OMERO.scripts provides a facility for running executables in the locality of the data. The use of ZeroC’s IceGrid permits farming out such tasks in Python, C++, Java, and in OMERO5 even ImageJ2 tasks to nodes which all use the same remote API. However, OMERO does not yet provide a solution for decentralised data and workflow management.

A logical next step for OMERO is to decentralize the data by increasing the proximity of data storage to processing resources, reducing bottlenecks through redundancy, and enabling vast data storage on commodity hardware rather than expensive, enterprise storage.

Notes #

How OMERO can scale with big data, higher demand #

1) as scope and # of users increase, total data increases

2) Data set size: hight content screen

Once data is in OMERO - excellent data management tool

OMERO services #

Ice gives us the capability to distribute some services to other hosts


That how we’d like to cope with big data in Omero but make it accessible for single user who wants to install it locally

Q & A #

Q: (Schatz) are you considering map-reduce or just storage?

A: we could definitely use them both, yes

back to the speaker list →

Spanning Molecular and Genomic Data in Drug Discovery #

John Overington, European Molecular Biology Laboratory, UK #

Abstract #

The link between biological and chemical worlds is of critical importance in many fields, not least that of healthcare and chemical safety assessment. A major focus in the integrative understanding of biology are genes/proteins and the networks and pathways describing their interactions and functions; similarly, within chemistry there is much interest in efficiently identifying drug-like, cell-penetrant compounds that specifically interact with and modulate these targets. The number of genes of interest is of the range of 105 to 106, which is modest with respect to plausible drug-like chemical space - 1020 to 1060. We have built a public database linking chemical structures (106) to molecular targets (104), covering molecular interactions and pharmacological activities and Absorption, Distribution, Metabolism and Excretion (ADME) properties ( in an attempt to map the general features of molecular properties and features important for both small molecule and protein targets in drug discovery. We have then used this empirical kernel of data to extend analysis across the human genome, and to large virtual databases of compound structures - we have also integrated these data with genomics datasets, such as the GWAS catalogue.

Notes #

Chemistry. Mapping of Chemistry - interface of chemistry with genomic and drug discovery data.

Background #

chemical space: how big is the chemical space. GBD-13 - all possible molecules (stable) with up to 13 heavy atoms

not all molecules can be drugs - needs to be bioactive

Lipinski - a molecule given these parameters was likely to have good oral drug prop.’s_rule_of_five

1019-23 libpinski like small molecules - potential drugs

around 21-23 peak in curve - size of heavy atom counts for drugs.
drug discovery - making molecules slightly larger than they need to be

GDB - 30% of all known drugs ?

Targets : homo sapients 21K genes.
Only 1% of genome is a drug target - we’ve been able to develop drugs against.
we’ve tried many many more

Chemogenomics = chemistry + genome derived objects #
ChEMBL - training set; largest db of medicinal chemistry data 1.4M compounds #

different drug structures - ligand efficiency

assay organism data

SureChEMBL - acquired SureChem

Compound Integration

Different Types of Drugs

Visualizations #
Pharma Productivity problem #

how many compounds does a company need to make before they develop a compound

Cancer Drugs and Targets #

come out with option of potential targets

Q & A #

Q: (Ouellette) finding out the chemical structures of various organisms; What about Micro-biome space?

A: Different animals have got different physical space for drugs they like. Controversy in literature - physical space for antibiotics. Micro-biome - fascinating - orally, also bacteria and guts. Effect of microbiome by gut bacteria - sometimes needed to activate substance

Q: (Stein) Curious about 1B+ compounds in GBD-17. Can’t release because of IP? Algorithm or structures?

A: Just too big. Drug discovery community -
Can publish the structures of all possible drugs=> can’t patent that - so will destroy all possible intellecual property.

Q: For compounds w/ rich sequence information (transcriptome wide/proteomic) is it integrated?

A: yes and no, transcript microarray data goes in GO or express. Links - compounds in ChEMBL. Reality - very small numbers right now. ChEMBL part of a suite of resources at EBI, link to other resources.

Q: Is there a way through ChEMBL to discover drugs that are potentially synergistic? Drugs with same structures and hit same targets. Connectivity map? X-ref between ChEMBL and connectivity map?

A: One of the most common uses of ChEMBL. combine drugs against the same targets. No links to connectivity map - people have done that.

back to the speaker list →


Now read this

Increasing developer engagement at Mozilla {Science|Learning|Advocacy|++}

I love watching a community come together to solve problems. The past two years, I’ve been testing ways to engage contributors on open source science projects. As Lead Developer for the Mozilla Science Lab, I built prototypes in the open... Continue →