Abigail Cabunoc Mayes

How to bring open source to a closed community

2016-09-19T21:19:23-07:00

This is (roughly) a transcript of my talk at Strange Loop this year! At least, it’s what I meant to say. Watch the video for all the fun Canada facts and nervous rambling.

Slides made using reveal.js. Screenshots captured using Decktape

First off, I want to thank the organizers for this opportunity. Strange Loop is such an amazing conference – I can’t believe I fist attended with an opportunity grant two years ago. The friendships and community I’ve built here have been amazing.

Let’s get started!

Hi, I’m Abby! This is me. I work for the Mozilla Foundation as Lead Developer of Open Source Engagement. This means I with with the open source projects and community around the different programs at the Mozilla Foundation including Open Science, Internet of Things, Women and Web Literacy, Learning and Advocacy.

Also, I’m from Toronto. This is important because Toronto is great.

A bit of history: I came to Mozilla because of the Mozilla Science Lab. Before Mozilla, I was working in research labs where we were dealing with so much data and analysis. It was easy to see how the openness and collaboration available on the web could make science better.

At Mozilla, our mission is to ensure the Internet is a global public resource, open and accessible to all.

The Science Lab is applying Mozilla’s mission to a specific community of practice. Most of the work I’m covering today was done within the Mozilla Science Lab.

So, today we’re talking about bringing open source to a closed community. Slight disclaimer: this is my story! This is not a how-to that will work for everyone.

The past eight years of my career, I’ve been working on open source projects for researchers and thinking of ways to bring more open source to academia. I want to share some of the lessons I’ve learned and hear from you as we start to expand to other Mozilla Foundation programs.

My story starts with open source. I actually wrote open source code for years before I fully understood this movement.

I find it’s helpful to look at origins of terms and words to help give some cultural context around what this meant at the time.

‘Open source’ is interesting because the free software movement predates this term by over a decade.

In 1997, Eric Raymond published an essay on the state of free software at the time, “The Cathedral and the Bazaar”. He saw two types of free software:

The Cathedral is a public space where anyone is welcome to attend a service, but the experience is put on by a small group of people in charge. They decide what happens and when. This is like a development team working on software among their trusted group, then releasing a new version to the public.
The Bazaar is an open space where people come along, setup tables and start bartering and selling whatever goods they have. Anyone can come and shape the experience in this space. Raymond saw this happening in Linux at the time, a diverse group full of differing agendas that was able to work together to build a stable system.

Also, I can’t be sure, but it looks like there might be a fire in this Bazaar. Metaphor for open source? :)

This essay inspired the Netscape Corporation to release the Netscape browser suite as free software the following year. This became the basis of the Mozilla Project and inspired the term open source.

I don’t know if you all remember the early 2000s, but there were no browser wars then – Internet Explorer was everywhere. The fact that a group of passionate open source contributors were able to come together and build Firefox, the browser that toppled the giant, was really amazing.

At the heart of open source is the idea that a diverse group working on a problem is better.

But how do we get there? How do we work in a way that brings this diverse group together in the first place?

At Mozilla, we call this “Working Open”, being public and participatory. This requires structuring efforts so that “outsiders” can meaningfully participate and become “insiders” as appropriate.

For me, this way of thinking helped me understand what open source should look like in our day-to-day work.

For the official definition of open source, the Open Source Initiative has ten points outlining what exactly open source software is. Having this comprehensive definition along with the OSI has helped the open source movement stay strong today.

The next part of my story is Science! I worked in research labs writing scientific software for most of my career.

Sometimes, trying to participate in research can feel like this. As soon as I left academia I lost access to most published research in academic journals. Even within academia, institutions can feel like ivory towers where only the invited few can participate.

These drawings are by John McKiernan for “Why Open Research?”

On the other side of the wall, there can be a lot of fear around getting scooped or someone stealing your data. This stems from a lack of knowledge around open licensing options.

Contrasted with my experience in open source, this helped me see that on both sides of the wall, there’s a need for culture change if academia is going to work openly.

One of the first projects I worked on when I joined Mozilla Science was Collaborate, a collection of open source software for scientists. This was a great way to highlight some of the work going on in this community, but after watching these projects for awhile, I learned that researchers weren’t very good at open source.

In general, the projects weren’t as welcoming as they could be. Sometimes, requests from potential contributors for more information would be ignored for weeks. This list of projects still exists today and helps the open science space tremendously, but we thought we could make this better.

This brings us to the final part of my story (and most of this talk): Fueling the movement.

I couldn’t find a definition of ‘movement’ that I liked, so I defined it here as “mobilizing a community around a shared purpose”.

One of my favourite visual representations of a movement is a this clip of a dancing guy fro Derek Siver’s TED talk on leadership and movements.

One guy dancing enthusiastically slowly mobilizes those around him. Once you hit critical mass, you have a movement! You watch him change the culture here in a few minutes.

This is a figure from Marshall Ganz’s essay “Public narrative, collective action, and power”. A key part of a movement is mobilizing people to action. This diagram shows how we need both the strategy and narrative (head + heart) to take action.

Working with researchers, many of them want to be working open and collaborating more – you can see how many open source for science projects wanted to be listed in ‘Collaborate’. However, there’s a lack of knowledge or strategy around how to do this effectively. This is when we realized we need to make resources outlining the steps involved in running an open source project.

So we started to think about how we can best fuel the open source movement within academia. I think we can summarize it in these three steps:

Resources: Creating the resources needed to mobilize others
Leaders: Selecting leaders in our community. Use the resources we created to mobilize them.
Mentorship: Helping our leaders mobilize others through mentorship.

First up, resources!

To create resources, we did an exercise focusing on the “Working Open” aspect of open source. How do outsiders become insiders on our projects? We’re going to do this exercise now as the audience participation section of the talk!

Think of a place you felt welcomed the first time you visited. This can be in person or online. I’ll give you a minute to think of a place in your head.

Okay, what places did you think of?

Some of the answers:

Strange Loop
Canada
College
Niagara Falls

Now, what made it welcoming?

Some of the answers:

Strange Loop
- Everyone is friendly and wants to know where you’re from and what you do. >> friendly, human welcome
- Food and snacks. >> takes care of our needs
- Smaller Preconf events >> makes it easy to find connections
- Opportunity grants >> makes it easy to get involved
College
- Orientation week >> orient people to their new environment, show them where they can get involved and make friends

How can we apply these to software projects?

Some of the answers:

friendly, human welcome
- say hi and welcome new people in chat, mailing list, etc
take care of our needs
- clear installation instructions, contributing guidelines
make it easy to get involved
- good README, starter issues for new people

We went through this exercise and came up with a bunch of ways to make open source projects more welcoming. We came up with these seven points and put together handouts for each point.

I think we came up with a lots of these in the exercise we just did!

Public repository: make sure your code, history, and discussion is public and available on the web.
Open license: this goes back to the official Open Source definition. Make sure your code is licensed in a way that others can legally contribute and remix your work.
README: Especially with GitHub, this is often the people’s first introduction to a project. Be welcoming!
Roadmap: At the very least, break down what you plan to do in issues. This way people know how can get involved and what work you’re looking for.
Code of Conduct: Collaboration is hard and collaboration with a diverse group can be messy. A code of conduct is a good step towards making people feel safe and outlining the behaviour expected in the group.
CONTRIBUTING.md: This is another file that has become more important because of the GitHub experience. Your contributing guidelines can outline how a new contributor can participate in your community.
Mentorship: This is a larger topic that covers both the attitude and strategy needed to make something welcoming and fuel a movement.

I’ll be sharing more about each of these steps later in the talk.

Next part of ‘Fueling the Movement’ is investing and mobilizing leaders. We can use the resources we just created to mobilize some of our more involved community members.

We did this at our first Working Open Workshop in February in Berlin. We brought together some of our existing project leads and more active community members. This was a group of people passionate about what we’re doing and eager to learn skills that would help their work be more open.

We put on a two day workshop going over most of the lessons from the Open Source Checklist. We built in lots of time for group work where participants could start applying the lessons they’ve learned to their open source projects.

This was a great start, but we wanted to keep up momentum after the workshop. We’ve all done weekend courses and workshops where we leave with the best intentions, but then life gets in the way and we forget. To combat this, we offered 1:1 mentorship after the workshop.

We planned this workshop to happen three months before our Global Sprint, a two day hackathon on open source and open data projects. The 1:1 mentorship would occur over the three months preparing the projects for the Sprint.

Now we’re going to draw out the movement in action!

We start here with Abby (that’s me!) and Aurelia, Community Lead for the Mozilla Science Lab. Aurelia is also a strong open source developer in her own right. The two of us decided to offer mentorship to all Working Open Workshop (WOW) participants.

27 people attended WOW.

25 of them signed up for 1:1 mentorship. We called this group the Open Leadership Cohort (OLC).

We met with each project every two weeks for a quick 30min check-in.

We started our mentorship meetings by setting goals. WOW was fresh in their minds! We helped set goals around:

Their community: what do they want their contributor base / user base to look like?
Their product: Will they ship a new feature or release an MVP at the sprint?

Then, we set a loose plan around how to accomplish this over three months. This set us up to be able to do lightweight check-ins every two weeks to see how things are going and where we need to troubleshoot.

As soon as we started, 8 new people were added to the program since many projects had co-leads who wanted to join in.

The yellow nodes are all the people that made significant contributions to mentored projects at the Global Sprint at the end of this round of mentorship. The contributions were significant enough that the project lead decided to give them a shout-out on the Mozilla Science Project Call.

It was great to see the project leads start to engage and mentor new contributors on their projects.

For a bit more background on the Global Sprint, here’s a picture from our 2015 Global Sprint. This year, we had 40 sites around the world all hacking from 9-5 in their time zones.

We saw a massive increase in participation through GitHub activity this year. I think this is directly linked to the resources we made on working openly which we offered to all participating project.

Now that we mobilized the leaders, we wanted to work with them as they mobilized others. We do this through more mentorship.

We selected a few of the people we mentored to become mentors in round 2. We intentionally kept the group of mentors small as we tested this out.

We wanted to test our this type of mentorship around open source in other programs. We asked each program to nominate a few community members for mentorship. We have participants from Open Science, Internet of Things, Internet Policy & Advocacy and more. We paired each mentor with 1-2 participants.

This round of the program started mid-August and is running till the Mozilla Festival (MozFest), Oct 28-30 in London UK. MozFest is the world’s leading event for and by the open Internet movement. All the participants and mentors in the program will be running sessions at MozFest – we’re using this program to help prepare their projects for the festival.

Now we’re going to look at a few stories and lessons we’ve learned going through this experience.

I’m going to go through each lesson from the Open Source Checklist and tell you a story about how that lesson affected someone in the mentorship program.

First up is having a Public Repository and looking at Achintya’s story.

Achintya is a science communicator at CERN and a PhD student in scicomm at UWE Bristol. We’re going to talk about how GitHub usage helped him centralize and organize efforts around his project.

Achintya has an interesting project, Open Cosmics: Cosmis-ray physics for everyone!

For a bit of science background: cosmic rays are high-energy particles that bombard the earth’s atmosphere. This produces showers of particles that we can detect on the earth’s surface. You can even detect these particles with your phone by installing CRAYFIS. You can also get a pocket sized detector from Cosmic Pi.

The problem that Achintya is tackling is that there are all sorts of ways to measure cosmic-rays, but each project stores the data in different formats. Achintya’s project, Open Cosmics, attempts to bring together all these efforts and help with interoperability and data standards.

You may have noticed in the “movement graph” that Achintya brought on three additional projects leads to this project. He was in a unique position where he acted a facilitator between all the projects collecting cosmic-ray data.

At the end of our first round of mentorship, when we asked Achintya was most helpful he said it was learning how to use GitHub for project management. GitHub gave his community a central place to community and the tools he needed to organize and discuss.

Now, Achintya is mentoring two other projects!

So, make sure your code is available! At the Mozilla Foundation we rely a lot of GitHub and have produced some trainings on GitHub for collaboration. But there are many other services you can use for your public repository.

Next, we’re going to look at having an open license and how that helped Rob.

This is Rob! He was fairly new to open source when he joined us.

This is a blurry Rob at our Working Open Workshop. We’re all doing the ‘Open Web Stretch’ here. I believe they’re all “leaning left to avoid the NSA”.

Rob’s project was creating a tool built around PubMed Central, a repository for life science and biomedical research. He created PMC-ref, a tool where you input a paper, then it checks which references in the paper are free to read.

It’s a pretty simple tool that can have a huge impact for a life sciences researcher. Especially if they don’t have access to all the big journals.

I mentioned Rob with this lesson, because going through his GitHub repo, he added an open license days after the Working Open Workshop. Yay MIT license!

If you see the yellow dot linked to him, Rob received his first open source contribution ever during the Global Sprint! The contributor, Deborah, actually wrote a blog post about her experience at the Global Sprint and contributing to this project. The fact that he had an open license made this possible and legal.

Rob is now mentoring Minn. Minn is running an interesting session at MozFest around facial recognition to create art and generate metadata.

choosealicense.com is a great resource for picking an open license for your software. For something easy, Mozilla Science recommends MIT or BSD.

Next we have Kirstie who really embraced writing a great README and having welcoming project communication.

This is Kirstie! She’s a postdoctoral researcher in the Brain Mapping Unit at the University of Cambridge.

We recently announced that Kirstie is one of the new Mozilla Fellows for Science this year! Mozilla Science has a fellowship program for researchers who want to influence the future of open science and data sharing within their communities. Fellows spend 10 months as community catalysts at their institutions and building lasting change in the global open science community.

During the first mentorship round, Kirstie worked on her project STEMM Role Models - inspire future generations by providing the most exciting and diverse speakers for your conference. She built a simple database of great speakers for conference organizers to use when planning an event.

Kirstie took to heart the idea that to make our projects as welcoming as possible, we need to have clear and friendly communication. Even here, on her draft landing page, she makes a real effort to welcome everyone at the top.

Looking back at the mentorship graph, Kirstie did such a great job explaining her project she was able to engage a couple contributors who did significant work building an MVP (minimum viable product). Kirstie has a background in neuroscience (not web development!), so watching her bring technologists and designers together to build something she is passionate about was really inspiring!

Now, as Kirstie begins her fellowship, she’s mentoring two projects including a group from the Detroit Community Technology Project. They’re addressing gentrification through storytelling technology and plan to have a booth at MozFest.

We have a few resources designed to help you write a good README and communicate your project.

First is the Open Project Communication handout.

In the handout, we include the Open Canvas, a tool I find very helpful when starting an open source project. Open Canvas is remixed from Lean Canvas, a popular tool from the startup world that helps you make a one page business plan.

I worked with Jordan Mayes from Top Hat, to remix this for open source projects. We removed some boxes that didn’t apply and added more thinking around community and contributors. You can read more about the process of creating Open Canvas in his blog post.

The Canvas forces you to think through the problems you’re addressing and your proposed solution. We divide the canvas in two main sections, Product and Community, to get people to think about their community, what they’re building, and how others will get involved.

Next in our checklist is writing a roadmap, featuring Bastian.

Here’s Bastian!

When I first email introduced Bastian to his new mentee, his mentee replied with “Thanks for introducing the Mark Zuckerberg of open-source genetics! What a great mentor to have!” and linked to this article. I had no idea this article existed! But I am not surprised considering Bastian’s project.

Bastian is a PhD student in bioinformatics, and was working on openSNP (pronounced open snip). SNP stands for Single Nucleotide Polymorphism, a type of mutation that can occur in your DNA. openSNP let’s you upload your 23andMe (or any other genotyping service) results online. You can learn more about your results, find others with similar genetic variations, and help scientists discover more genetic associations.

When Bastian first uploaded his genetic data on GitHub (before he made openSNP), he received an email from someone who found the data online and analyzed his genetic report. The analysis said he might have an increased risk of prostate cancer. Since this type of mutation is inherited, he told his dad to go to the doctor. They found a tumour growing in his dad’s prostate, but they were able to catch it in early. His dad is alive and well today.

OpenSNP benefited greatly from going through the Roadmapping exercise! I worked with Bastian and Philipp (co-lead on the project) to plan out a few features and fixes that needed to be done. This helped then identify the need for new volunteers and shaped up a few projects they could submit to Google Summer of Code (GSoC).

By making a roadmap, they were able to accomplish a tremendous amount in a few short months. You can read about their GSoC experience on the openSNP blog.

We have a couple exercise you can go through to write a roadmap for your project. Writing down what you plan on working on helps new contributors know where they can get involved.

A roadmap can be a simple as a collection of issues in your issue tracker to a comprehensive wiki outlining the future of your project.

This handout walks you through picking a few milestones and breaking down the tasks needed to get there.

Next up is code of conducts with Richard.

You might notice from the graph that Richard wasn’t part of the first round of mentorship. Richard was actually a 2015 Mozilla Fellow for Science. He did some amazing open source work during his fellowship year, I thought he would be a great mentor.

Notice the moss beard in his avatar.

Sadly, he doesn’t walk around with a moss beard in real life. This is a picture of Richard and his partner Steph at MozFest 2015. MozFest is so awesome that we had capes, buttons, and fox masks. You should come.

I listed Richard under code of conducts since he has an incredibly thoughtful approach when writing documentation for communities.

You can see the code of conduct Richard wrote in the last link, Slidewinder Code of Conduct.

In this particular code of conduct, he has a section called “Open [Source/Culture/Tech] Citizenship” that outlines the goals of having an open culture and encourages others to reward welcoming behaviour. I think this is incredibly important as we’re trying to build welcoming communities.

If you get stuck, Mozilla Science has a CC0 code of conduct you’re free to take, remix, and use however you like!

Next is Contributor guidelines and Tim.

Tim was a physicist at CERN when we started the program. He recently moved to Zurich and is now a tech consultant.

Tim was working on Everware, a project trying to address reproducibility in scientific software. Everware uses Docker to launch an instance of a jupyter notebook directly from a GitHub repository.

Tim cares a lot about reseach reproducibility. I first met him at a hackathon a CERN where he first launched Everware and ruffled some feathers with his insistence that we need to focus on better research reproducibility.

Now, Tim’s mentoring two other groups including one looking at research reproducibility, ReFigure.

Tim and the other Everware developers wrote some great contributing guidelines that helped quite a few people get involved before and during the Global Sprint.

For resources, we have a guide that walks you through creating your contributing guidelines.

The file should be named CONTRIBUTING.md and placed in your root directory.

We break down the different parts of your contributing guidelines in the exercise.

Open with some cheer! You should celebrate someone looking to contribute to your project. Then, introduce the document and explain what these guidelines are for.

The bulk of the document should be some how to guides on contributing, along with expected norms the group follows, like a style guide.

The CONTRIBUTING.md naming convention has become popular since GitHub integrates this in their interface. If there’s a CONTRIBUTING.md file in the root directory of a project, GitHub will display this notice as the top of the page whenever someone opens a new issue or pull request.

The last step is our catch-all for attitude and process, Mentorship. Here, I’m highlighting Madeleine since she’s done a great job including and delegating to others.

Right off the bat, you can see how connected her node is in the graph since she’s been able to bring so many people into her work.

Madeleine is a PhD student at the University of Toronto. Madeleine not only runs an open source project with us, but she also runs a weekly open science meetup at her school, the UofT scientific coders.

Madeleine (on the left) actually spoke about running events at our Working Open Workshop because of her experience with the UofT scientific coders.

Her project is phageParser which uses open data to better understand CRISPR systems. CRISPR is all the rage nowadays because it’s opened the door for faster and cheaper targeted gene editing.

CRISPR stands for Clustered Regularly Interspaced Short Palindromic Repeats. In the diagram the black diamonds are repeating DNA. In between the repeats are spacers. Spacers are pieces of DNA from a virus that attacked the system. The CRISPR system saves the virus DNA so that if it comes across the virus again, it can recognize it and cut it out, hence targeted gene editing.

Madeleine’s group realized that there are many openly published genomes with CRISPR systems. Her project is trying to collect and analyze these systems to try to find patterns and learn more about CRISPR.

Madeleine was able to engage so many people during the Global Sprint that she ran out of tasks for new contributors. I’ve noticed that Madeleine is naturally good at finding tasks and asking others for help, both in her project and with the UofT scientific coders.

At her first UofT scientific coders meeting, she delegated registering a club, managing the GitHub repository, and baking cookies for next week. Most of those people are now the green dots co-leading the group.

For those of us who need some instructions on how to delegate and involve others, we have a few exercises to help you start thinking about mentorship.

Good first bugs can be a great way to give a new contributor a small win when they first start working on your project. Identify a few smaller issues that would be appropriate for someone completely new to the project. Ideally the hardest part of completeing this issue would be setting up their development environment.

This helps you reward new contributors sooner.

Another exercise that helps you think about a contributors progression through a project is the Personas & Pathways exercise. This gets you to create a persona of an ideal contributor. Then, you can outline their pathway from when they first hear about the project, to their first contribution, to becoming a maintainer, to maybe even running the project when you’re ready to hand it off.

To summarize, these resources are helping us mobilize leaders in the open source movement.

Combined with trainings and mentorship, we’re working to fuel the open source movement in science, advocacy, learning, IoT and more.

I mentioned MozFest a few times, you should all come! It’s a lot of fun and you can meet a lot of people and projects I highlighted in this talk.

MozFest really is a place where “you can make things that matter”

Huge thanks to the many people that took part of the mentorship program as participants, mentors, and content creators. There’s a lot of people that made this happen.

You’ve all been awesome!

Thank you!
Slides: acabunoc.github.io/open-source-strangeloop-2016

Note:
I talk about projects from a lot of different fields in this presentation. I’m not an expert in all these fields, so I may have explained something wrong here. Happy to make corrections! Please be kind!

Increasing developer engagement at Mozilla {Science|Learning|Advocacy|++}

2016-07-07T04:27:46-07:00

I love watching a community come together to solve problems.

The past two years, I’ve been testing ways to engage contributors on open source science projects. As Lead Developer for the Mozilla Science Lab, I built prototypes in the open with our community while mentoring others to do the same. We’ve seen exponential growth in contributorship and mentorship, and I am incredibly proud of the work we accomplished.

I’m excited to be moving into a role where I’ll be extending the contributor pathways we’ve built in the Science Lab to other programs within the Foundation. As Lead Developer, Open Source Engagement at the Mozilla Foundation, I will be shaping how we interact with the open source community not just in Science, but also in Learning, Internet Policy & Advocacy and newer efforts like Internet of Things and Women & Web Literacy.

Mozilla’s mission is to ensure the Internet is a global public resource, open and accessible to all. People are key and necessary as we work towards Mozilla’s mission through the lens of each program.

Starting in Science #

Starting this experiment among academic researchers in Mozilla Science helped prepare us to reach the broader Mozilla community.

The Science Lab community is a cross section of Mozilla’s community. Within Mozilla Science, we’ve hosted projects exploring IoT, research policy, women in STEMM, education and more. These projects helped us learn how to engage a diverse community.

Bringing the concept of working openly to academic research has helped us understand a wide array of complex challenges. The research world is full of competition, private data and a cutthroat need to publish. These challenges forced to us articulate why open matters and emphasize a scalable mentorship model as we work towards culture change.

Contributor Pathways #

Modelling the contributor pathways we used within the Science Lab, we’ve found four stages needed to create a cohesive pathway for contributors.

Sourcing: Finding new contributors. This can be passively on a project or more actively at an event or in a specific ask.
Onboarding: Intentionally onboard new contributors to answer:
- WHY: Why Mozilla? Why open source?
- HOW: How do they practically contribute? What steps or skills should they know?
Prototyping: We need to build with our community. This gives contributors a chance to learn and practice collaborating while building new features or tools.
Training & Mentorship: While prototyping, work is constantly acknowledged, rewarded and refined. As contributors learn to bring others into their work, they may take on the mentor role to newcomers.

Taking these ideas, I’ll be working to see how we can define and measure a contributor pathway across the Mozilla Foundation.

What Next? #

Over the next few months, we’ll be studying how different programs across the Mozilla Foundation work with their contributors. At the same time, I’ll be continuing to work with the volunteers and mentors in the Science Lab.

I’d love to hear your thoughts and feedback on the idea of setting contributor pathways across the Foundation. You can reach me on twitter @abbycabs or email me directly at abby at mozillafoundation.org.

We’re entering an exciting time at the Mozilla Foundation as we break out of our prototyped, siloed programs and share how we’ve been successful. The Mozilla Science Lab - and our other community centred programs - will be so much stronger as we collaborate across our combined networks. Together, let’s build a better internet!

What I learned working at WormBase / OICR

2014-09-09T05:34:10-07:00

Three weeks ago, I left the Ontario Institute for Cancer Research (OICR) to join the Mozilla Science Lab. Yesterday would have been my five year work anniversary at OICR. Since I don’t get a plaque now, this seemed like a good time to reflect on what I’ve learned as I begin a new chapter at the Mozilla Science Lab.

Majority of my time at OICR, I served as lead developer on the WormBase project. I learned a lot about software, leading development teams and dealing with biological data; But my biggest takeaways came from watching the interaction between the scientific research community and the web.

Here are three lessons I learned in my five years at WormBase / OICR:

1. A simple web app can have a huge impact on a research community

WormBase is a highly curated biological database for nematode (aka roundworm) research. I worked to make it as easy as possible for researchers to find and consume this information.

It took me a few years to realize how unique WormBase is - an entire research community, spanning several topics, depends on this tool. The information there is vital to getting new students up to speed and it also facilitates insights and new discoveries. Thanks to regular worm meetings, you don’t have the regular barriers between fields within the worm community.

WormBase and the worm research community have helped each other grow over the years. Having all this information available has been hugely beneficial to anyone interested in nematodes. I want to see this happening in more areas of science.

2. Open source and open access makes science better

We were able to build WormBase with a small development team by using many open source tools. From the web framework (Catalyst) to bioinformatics tools (GBrowse, Intermine, more), most of WormBase was written by other people. I am grateful that so many bioinformatics research groups have embraced open source and given us tools that make the web useful for science.

OICR has some great champions for open source and open access in the research community. They understand that the best ideas don’t always come to the people who have access to resources today. Working beside these giants, it’s easy to imagine a world where any researcher - even the lowly undergrad - has access to tools and data to help make discoveries and further science.

3. Doing this right is hard. We need to communicate

Mistakes are made. Development resources and talent aren’t always available in the research world. Barriers to access range from technical to legal to personal. I don’t always understand what a worm researcher (or any researcher!) is looking for.

I want to see this all work - I want more discoveries, more tools and more science. But I’ve learned that this takes a lot of communication to do properly. WormBase is a huge team with even more stakeholders. It works because they know and are involved in their community; They understand what they can do to help.

By contrast, I’m joining a team that serves a much larger research community (i.e. the whole research community) that I personally don’t fully understand. I am so thankful that Mozilla Science Lab is full of volunteers spanning a wide set of research fields. Together, we can understand this space and help research thrive on the open web.

Join us: community call, mailing list, @MozillaScience. And of course, you can always reach me on twitter @abbycabs.

Joining the Mozilla Science Lab!

2014-08-21T04:52:40-07:00

Breaking news: I’ve joined the über-talented team at the Mozilla Science Lab as lead developer! I’ll be leading technical prototyping efforts and engaging the community about our technical projects.

Why Mozilla Science Lab? #

From mozillascience.org:

The Mozilla Science Lab is a new initiative that will help researchers around the world use the open web to shape science’s future.

I have unashamedly fallen in love with the ideals of open source and open science. I’m enamored with what openness means and what it could look like in the scientific community.

The need for openness in research is there: I’ve seen the struggles of data sharing, the fear of collaborating and the uncertainty of best practices. It leaves you with duplicated efforts and more file types than you can count. On the other hand, I’ve witnessed the beauty of open source software driving analysis and innovation within a community. I’ve watched ideas spark when communication lines open up. The time I spent at OICR and WormBase introduced me to openness in science in a tangible way – and it looks good.

I joined the Mozilla Science Lab because I love their mission of making the web work for science. This group has the power and means to change the culture within the research community.

There is incredible potential when you apply a movement that wants to “build the internet the world needs” to scientific research – a discipline that desperately needs an open internet to build on, but doesn’t quite know it yet.

What Now? #

It’s been a few days since I joined Mozilla and I’m already inspired by the community and Mozillians surrounding me. These people gathered around a shared mission – one that has and will continue to change the world we live in.

In the Science Lab, I’m getting to know the different people and projects involved (more on the projects soon!). These efforts would be nothing without the community (ie YOU). From researchers to developers to educators, we are here to help you learn, build and connect to others with the same mission.

So come! Help us make research more like the web: open, collaborative and accessible.

Join us: community call, mailing list, @MozillaScience. And of course, you can always reach me on twitter @abbycabs.

Biocuration 2014: Battle of the new curation methods

2014-04-16T14:10:37-07:00

Biocuration is incredibly important to progress in science. The process of sorting through and annotating scientific data to make it available and searchable to the public is at the heart of the ideas behind open science. I work at WormBase because I believe in its mission to curate our knowledge of nematode biology to make it freely available to the scientific community.

#isb2014 great turnout pic.twitter.com/V14if4ecgr
— Paul Davis (@bayamo2003) April 7, 2014

The Seventh International Biocuration Conference (ISB2014) was held at the University of Toronto last week. The theme of the conference this year was “Bridging the gap between genomes and phenomes”, focusing on bringing the results of the biocuration efforts to the clinicians. However, a slightly different theme stood out to me during the meeting - the tension between different methods for improved curation.

There was a clear consensus that we’ve come to an inflection point in this field. It’s no longer worthwhile, or even possible, to have detailed manual curation for each piece of biological information. Data is being generated (see NGS) and papers are being published at a tremendous rate (>100 publications/hour). Human eyes can’t keep up.

“That is the slide”. Not a mistake, showing mess abundant data #isb2014 pic.twitter.com/uRG6wAShEB
— Marc RobinsonRechavi (@marc_rr) April 7, 2014

We need to look at data as a whole. Many groups have come up with ways to automate/distribute the biocuration process, with a focus on information extraction from text. Three main approaches were presented: (dictionary based) text mining, machine learning and crowdsourcing. While biocurators are civil individuals, there’s still a sense of competition between the different methods and tools.

Big Data Curation panel. My current status #ISB2014 pic.twitter.com/y2QYjX21WC
— Abigail Cabunoc (@abbycabs) April 8, 2014

‘Best Tweet’ award winner at #ISB2014 by yours truly. Disclaimer: I did not create this meme. I saw @escalant3 RT it awhile ago

We’re at a period of unrest while the community is deciding how to handle this ‘Big Data’ we’re faced with. In the coming years, we’ll see best practices and standard tools emerge. In the meantime, here’s a brief overview of the work presented in this area.

Text mining #

Text mining based on a knowledge dictionary: this has been a friend of biocuration for a long time. Everyone and their uncle has a text mining tool and strategy they love and support! Most tools focused on being a first-pass on a paper or abstract to help call out/screen information before an expert curator takes a closer look. The BioCreative workshop in particular demonstrated some of the recent work in this area.

Overall, the community is generating much more usable and intuitive text mining tools meant to be used as a first-pass for biocuration. A couple tools that stood out to me:

PubTator: If you have a list of articles, this helps you sort through and find the most relevant publications to focus on
Factoid - Bader Lab: Turns an abstract into an editable model of biological processes. Really nice UI, using Cytoscape.

Notably missing from the meeting: Textpresso from WormBase.

The tools in this space are getting more usable and accurate. However, they still require an expert curator to look at the results. We saw in some talks that this approach may not perform as accurately as some machine learning algorithms. I’m interested to see if the research and development focus will shift away from text mining in the coming years.

Machine learning #

Just a few years ago, machine learning algorithms weren’t preforming as well as text mining on biological data. However, with larger and larger datasets becoming normal, machine learning has begun to surpass text mining in accuracy in some literature-based curation tasks. (Gobeill, Supervised text mining for functional curation of gene products: how the data and performances have evolved in the last ten years [notes - line #582])

More researchers are writing machine learning algorithms to extract information from their data. So far, these are generally ad-hoc and highly specialized algorithms, with some exceptions (GOCat). We are beginning to see some user-centred tools powered by machine learning algorithms, and I hope to see even more in the future.

GOCat: Offers both dictionary based and machine learning models for extracting Gene Ontology terms from text.
GIST: using machine learning to provide improved annotations for species from reads [notes - line #130]

Crowdsourcing #

Crowdsourcing is the cool kid in this space. Science has a history of failing where Wikipedia and others have succeeded. But this meeting showed a couple promising approaches to crowdsourcing in biocuration.

Ben Good deservedly won the ‘Best Presentation’ award for his talk, Microtask crowdsourcing for disease mention annotation in Pubmed abstracts. An interesting use of the Amazon Mechanical Turk applied to science. [slides]

Apollo: less about information extraction from text, more about community genome annotation [slides]

This approach is shiny and full of potential. I think many scientists in the audience were inspired by the idea of microtask crowdsourcing in particular. While I think the ‘microtask’ of interpreting biomedical literature is unusually difficult, there are huge possibilities in this space if the right tools and approaches are developed.

Conclusion #

These are the unsolicited opinions of one web developer with a particular interest in WormBase on the state of biocuration today. There is a lot of innovation in this space - I’m excited to see what happens in the next few years. Don’t worry about being replaced, biocurators! Even with all the automation going on, everyone agrees that biocurators are needed more than ever.

#ISB2014 L Stein says there are in fact, jobs for biocurators pic.twitter.com/BGzpk21rbP
— Melissa Haendel (@ontowonka) April 9, 2014

Biocuration jobs: (metadata) massage therapist, (data) wrangler, (complex data) modeler

WormBase Website and Biocuration

2014-04-04T13:51:38-07:00

The Seventh International Biocuration Conference (ISB2014) begins tonight here in Toronto.

Correction: Poster #44! #

Poster: WormBase Website: Supporting the Biocuration Process

WormBase Website: Supporting the Biocuration Process #

Abigail Cabunoc, Todd W. Harris, Lincoln D. Stein

WormBase (http://www.wormbase.org/) is a highly curated central data repository for Caenorhabditis biology. Our objective is to capture the wealth of experimental data available from C. elegans and related nematodes via published literature and personal communication, and present it to the research community in a way that facilitates new biological insights. Although the website is geared towards end users, we added several features to support the biocuration process.

Flexible views were a central design factor in the new website allowing users to customize the information presented to them. We extended this customizability to WormBase curators, with a “Curator only” view. This view allows our curators to view specific metadata related to the curation process and use tools for exploring the arcana of the underlying data model that are not available to the general public.

The ability for curators to add real-time annotations to the website was added as a response to the current lag between data curation, integration, database build and website display. Curators use this to create up-to-date summaries and descriptions for each species or data class, typically information not specifically tied to any release of the website. Such a system could also be used by end users to annotate current data in real-time.

The website was also redesigned to help encourage community annotations and engage the community in the curation process. Every page on the site has a tab prompting users to submit any content corrections or feedback they may have. Users also have the ability to create public comments directly on a report page in WormBase.

Aiding biocuration was one of the main goals of the website redesign. This has provided a space for real time updates, customized views of the data for curators and increased community engagement. While these features are currently available on the website, more work can be done to use them effectively and help bridge the gap between curators and the community.

Big Data in Biology: Big Data Challenges and Solutions: Control Access to Individual Genomes

2014-04-01T06:05:57-07:00

Series Introduction: I attended the Keystone Symposia Conference: Big Data in Biology as the Conference Assistant last week. I set up an Etherpad during the meeting to take live notes during the sessions. I’ve compiled all the abstracts, notes and slides (where available) here. Shout-out to David Kuo for helping edit the notes.

Warning: These notes are somewhat incomplete and mostly written in broken english

Panel: Big Data Challenges and Solutions: Control Access to Individual Genomes #

Monday, March 24th, 2014 2:15pm - 4:00pm

http://ks.eventmobi.com/14f2/agenda/35704/288348

Panel members #

Moderator - Doreen Ware (DW), Cold Spring Harbor Laboratory, USA
David Haussler (DH), University of California, Santa Cruz, USA
Laura Clarke (LC), European Bioinformatics Institute, UK
Jill P. Mesirov (JM), Broad Institute, USA
Andrew Carroll (AC), DNAnexus, USA
Lincoln D. Stein (LS), Ontario Institute for Cancer Research, Canada
Mark Gerstein (MG), Yale University, USA

Notes #

DW: Interaction between panel members/audience

Started planning this meeting almost 1.5 years ago - we decided that controlled access would be a main talking point

Challenges and opportunities #

scale (volume)
variety - the heterogeneity of the data. Representation and analysis of this. How will we deal with metadata? how to integrate?
timeliness - velocity, getting data, operated on, updates
privacy, topic of this session - key point
usability, want this data to be useful - accept human input and support collaborations. Interpretation of the data.

Personal Genomes #

1000 genomes
publishing their own genomes
personal genomes project (George church)
GigaDB - liver cancer patients
more examples of having access to this data is not easy! privacy and bio-ethics. nature - privacy protections the genome hacker. some of the privacy we think we have isn’t as private as we think. ‘anonymized’ sets - can be identified by combining the data. How will we handle integration?

Panel Introductions #

Mark Gerstein (MG) - Yale (bioinformatics)

originally worked in model organisms
transitioned to human genomic - scale, but not really privacy issue
disease genomics - privacy issues!

David Haussler (DH) - UCSC

running into all kinds of data issues
go through long protocol to get to all data sets
cancer datasets - didn’t make it clear didn’t make it clear it was a childhood cancer study, was rejected
subtle consents get crazy

Laura Clarke (LC) 1000 genomes project

managed access data on some projects - trying to make applications as lightweight as possible
new open/managed accessed project - not clear how to make useful

Lincoln Stein (LS) OICR

works with ICGC DCC - make cancer genome datasets available as frictionless as possible
open and controlled tiers
main concern: maximize access to data - make useful- do not violate donors trust. donated under agreement used for research and no other purposes (identification)

Jill Mesirov (JM) Broad

most of the time collaborators worry about permissions etc.
there’s a tension
mostly clinical studies - patients want to do whatever they can to help us understand their diseases. BUT: learned that consents that they sign aren’t necessarily consumable by the average person
many patients don’t understand - if i share my data it doesn’t just affect me, but my relatives and other people that share large pieces of their genome with me
other issue: ethical/legal. a lot of the problems with disclosing the identity of the patients data and clinical info and genetic info is that it can affect things like hiring, insurance, liability. these risks need to be clear to them

Andrew Carroll (AC) DNAnexus

used 1000 genomes data
used CHARGE consortium data - under IRB restrictions. only combine in appropriate way, keep data flowing consistently
used pharma company sequencing data - internal for r&d

DW: Q: Are the current support systems right now sufficient?

LS: The issue a lot of researchers are encountering - like cell phone makers, every phone contains thousands of licensed technology: Need to negotiate with each maker of hardware/software component. It’s beginning to get a lot like that in genomics - each one is consented for different rules. Cancer research, pediatric research, general research… must observe restrictions on each of the components. Makes it very difficult to combine two datasets. Even using controls - can’t use other sets as normal controls in a cancer study if they only consented for a diabetes (or other) study. Need uniformed consent - stop focusing on dataset and focus on researcher - have an ethically approved researcher status. If I pass every year

JM: One of the things I observed at Broad - datasets will come in to Broad and will take on a life of its own. Shared in ways that are not appropriate for consent - through ignorance. Implications for the data not made clear. We put in place a training program around how you handle this kind of data, and to minimize the replication of this kind of data. Got authorisation - did not duplicate the data. Track better who is accessing what.

LC: As we move towards centralised compute and moving analysis to compute - these sorts of challenges will be easier. One of the key points of making this data useful is better defined consent.

MG: I second the points of LS and JM. Most people who would inappropriate use private data is accidental - ignorance. They do that because it’s easier - just copy dataset, don’t go through protocols. Need to make good tools and infrastructure so there’s no incentive to do it wrong.

DW: Moving forward (question to JM), do you think there’s a need for some sort of education on handling this data?

JM: Yes, especially trainees who are beginning their research career. Human subject certification test - goes on forever. These are the key important things you need to understand: These people are giving you a gift, something very personal about themselves for you to further your knowledge and help treat the disease. In turn you have to respect that, and here are some simple rules on how to do this.

AC: Looking at this in a technical sense. Many people working on this in a flexible way, someone will make a mistake. We need to architect technical solutions that make it easier for the graduate student to not make a mistake.

MG: Not only the grad students - in clinical orientation. Clinicians sloppy about where they put their data and how they move it. Need to educate.

LS: There’s a lot of debate in the USA on the legality on putting genomic data in clouds. Misguided debate - more secure than letting grad students play with it on laptops.

DW: do you think the compute clouds are secure enough to share among collaborators

DH: Appropriate levels, the cloud vendors can be more secure than the NSA. It’s going to be so much more secure than at any medical institution. Need to work with the cloud vendors to come to terms with a compliance framework.
Institutions may not want to change for historical regions (e.g. consent forms specifying where data stored). Why does banking accept cloud and not NIH?

DW: Are the current restrictions on whole genomes too restrictive?

AC: depends on what you want to do, how ambitious you want to be. There’s an immense amount to discover. If it’s not acted on in an academic setting, pharma will go out and sequence their own pools and make their own discoveries. The value is there - if there isn’t a means to get at it they’ll find their own way.

DH: There is a willingness to use controls and share in Pharma

LC: Pharma doesn’t want to make this massive investment by themselves individually.

LS: technology is enabling lots of things. Patients that have a serious disease are very willing to share their genomic data for the greater good if its handled appropriately. PMH (Princess Margaret Hospital, Toronto) study - has sociology group to get attitudes on genomic sequencing. When patients were asked:

‘would you be willing to share your mutations with researchers?’: 100% positive response rate.
‘would you share your germline polymorphism around areas relevant to your cancer’: still positive responses
‘would you share incidental findings’: Complete drop off - almost nobody in the study wanted incidental findings disclosed.

Need to rework regulatory framework and the way consents are posed in order to address the read and perceived harms to patients/donor/ family members

JM: This is an area of intense activity. Regulations/consents/risks conveyed. It’s a tricky business.

Q: (Ouellette) What if people were told that these germline/incidental findings would help others?

LS: the way you ask greatly affects response. Wording. We want to look directly at what the short term and potential long term harms are. Short term - non-paternity. There will be people trying to figure out if their friends/neighbours are in a cancer db.

MG: I am a privacy advocate in this context. What is the harm that can happen? People don’t know exactly what the disclosure of their genetic info will affect. There will be a major harm to genomics and bioinformatics as a field if people commit stunts/db gets hacked and break privacy. We have to think about how this reflects on teh field. It’s a potentially a bad thing - consent implies they’re really trusting us. You really have to understand the trust. If you breach the trust, everyone looks bad.

LC: Mark (MG), what do you think will be appropriate consequences? How do we maintain the societies trust?

MG: Concept of license (LS) - you are a responsible researcher, prove it and update it.

JM: Read Yaniv Erlich’s paper. We need to understand what data we can and what data we shouldn’t share. Some data was on ancestry.com - the db had addresses (city and state locations). It wasn’t the case that he went to a repository of data that was just genomic data. He was very clever, used a lot of ancillary information around the particular genomes to get that information. Great paper, raises a lot of issues. We could be disclosing identity.

NB: Paper mentioned:

Gymrek, Melissa, et al. “Identifying personal genomes by surname inference."Science 339.6117 (2013): 321-324.

http://data2discovery.org/dev/wp-content/uploads/2013/05/Gymrek-et-al.-2013-Genome-Hacking-Science-2013-Gymrek-321-4.pdf

DH: We need to start thinking about privacy in terms of granular facts and how they are linked. Separate the idea of what is information that can be public and associations between those that are private. Internet of things/facts - if you can link multiple of facts to the same person it causes a violation of privacy. Share private information - sharing the linkage -> previously anonymous data becomes controlled information.

who can see it
for what purpose.

Need to have remedies and ways of looking at controlling and approaching privacy.
Linking too many facts about one person

AC: Where this chain is broken - where someone is able to tie outside information to a piece of genomic sequence. It becomes easy to identify everyone related. For example, Bitcoin - if you can break some of these hashes, you can determine entire transactional history. Single break in the link will expose many people.

LS: So we need to ban the genealogy databases (laughter). That will break all the links and allow any piece of information be linked

DH: If you break it up to enough pieces, each piece will be uninformative

MG: There’s still the issue of outliers: you’re going to have outliers. Maximum income in a survey - you know who that is. Correlation - lot of these factoids have subtle correlation, can do de-identification. A few bits of information, some simple correlation

DH: I disagree. Suppose there’s a position on the genome where only person has an A at this position. Suppose I publicize that only one person has an A at this position.

LS: But what’s the usefulness of this for research - one single position? Once we get to the usefulness part, we run privacy risk.

DH: Yes, when we link things. dbSNP is fine, no one argues it ruins our privacy. There are stats

LC: There’s a lot of info they won’t put in dbSNP - because it becomes identification. These are the pieces that are important for research

DH: once we establish world of anonymous facts - can have private exchange of links

LC: are the barriers too high to establish that system

DH: it’s all out there with UUID, everywhere. Whole protocol is based on keeping secure private key chains

Q: (Schatz) Do you think it’s a problem that perception in popular press of accurate identification?

LS: issue in popular press is that the informative power was oversold during the Human Genome Project. They’re not happening at the rate people expect

AC: Scope, sensitivity isn’t great. The problem will take care of itself

DH: people tend to overestimate the impact in the 5-10 year range, underestimate in 20+ range

LC: Global alliance, verification

Q: How do we contain/quantify the privacy that is consented for? Can we come up with metrics that quantify 1) uniqueness 2) identifiability? Actuarial tables to find uniqueness?

DH: We need to come up with categories - this granularity if it’s anonymous is not identifiable in itself. Then, only think that’s private is linking to pieces. Don’t think it’s a matter of counting how many people have that type of value. We can make assessments where it’s granular enough.

MG: Agree with DH, make a few observations: theory of information. Risk: relationship between amount of info leaked and amt of risk taken.
When we talk about this information leakage, we’re talking about identifiability risk AND characterization risk. But don’t consent to having all your proclivities/characteristics unearthed over time.

Q: Danger of privacy - when the first db gets hacked. Are we selling these databases as being secure to the public? Change legislation - can’t be discriminated against. This will eventually be leaked - add more security or lessen consequence of being identified.

JM: This is critical - legislation. If I can’t get health insurance because of BRCA mutation, it’s important. Can’t get a job because genetics are known. Some initial legislation has been passed, but it’s up to us to lay out what this will look like. Serious risks to people in terms of daily life.

AC: everyone agrees we have to have the greatest legislative on people. But even if its passed, it will not be sufficient - discrimination still happens.

Q: Whole genome will be cheap and accessible and non-scientists will be able to get this done. In that world would you be able to get a hair from someone and collect the data yourself, you can circumvent all these security issues.

LS: Real and scary scenario - happening in law enforcement. Suspects are routinely genotyped without their knowledge.

JM: You should all watch the movie GATTACA - logical and scary extension of all of this.

http://www.imdb.com/title/tt0119177/

LS: The federal databases of genotype data are extremely well thought out. # of SSRs is just enough to identify people to narrow down suspect pool, but not enough to pick out a single person in the whole US. But the state unregulated.

MG: Privacy bias - genetics has a checkered history. Darwin, 1920, etc. Given that history it’s good to reflect on this future.

Q: There are other communities that have faced this: Census department. They understand the benefit to provide data to researchers (summary statistics, de-identified sub-sets, experimenting with creating simulated sets where the probabilities of the data are mirrored and operated on). Can there be a parallel track where we start to experiment?

MG: Big Data is about data not simulations. Very doubtful if a simulation could recreate the linkage.

DH: The linkage of all of our genomes is a product of our common heritage, as it gets dense we’ll reach a critical point where we can do a lot of inference.

LC: If someone comes up with a Facebook for human genomes, way more useful

LS: I once ran a thought experiment across wife’s relatives (south indians). If you have a cell phone app where you can search for all your relatives within a nth degree radius. "Oh yes, love it This would save so much time!” - Then I say, all you have to do is donate a bit of DNA - “sure!”

JM: I look at my children and their friends - their notion of privacy is very different. Sharing their genomes would be a drop in the bucket.

AC: Enough people would share so that if your genome data is linked they could identify you.

MG: Then when you’re the only person left who cares about privacy, you’re identified as the one person who hasn’t shared.

Q: We talked in circles around the legal issues - imagine the outcomes. Danger: if we don’t do this, we’ll end up in a dystopian situation where we can’t talk to each other.

Q: At a big data conference. Difficult to link these entities - that’s why we’re here, to make these links. How privacy affects the downstream. Should there be a consideration ‘upon your expiration your data is withdrawn’?

LC: I had 23andMe done. I discussed this with my parents, but not sisters.

JM: Watsons personal genome published but not APOE status
http://www.nature.com/ejhg/journal/v17/n2/full/ejhg2008198a.html

MG: People were able to trivially determine his APOE status.

JM: Concerning: people often don’t understand that a lot of these genetic variants are just a predisposition to certain endpoint. The kind of education that is required is huge - help people understand probabilistic.

LS: Recently discussed Canada’s policy on withdrawal of genetic information. Proposed to allow 1st degree relatives to withdraw on demise. This was unworkable - what if siblings disagree?

LC: We’re getting paternalistic - remember getting this data out and easy to use will be of such benefit to health and science. We shouldn’t put too many barriers up.

DH: We owe it to our grandchildren to do our best to understand how genomes and disease are related

MG: One thing that’s important - there’s a lot of countries that don’t care about privacy. Their legal system setup is not ready to worry about this. I can imagine a future where most of the genomics and discoveries are centred in places that don’t put up these barriers

Q: We should license people to use big data analysis. Age of big data - privacy is an illusion. You can go to someone’s home and know everything about them. If someone wilfully wants to know you - it doesn’t cost that much.

LS: agree

AC: it all comes down to how many people have access to data. We want to provide a technical solution robust enough share with research help cut some of that off.

DW: What are some of the technical barriers in the next 5 years

LS: enabling people to get into cloud (or whatever) and use it. Accessible to as many people as possible in a secure manner

JM: How do I find out what data is out there that’s relevant to my particular project/study. Better metadata. If I could find the sets I need - I don’t mind going to whoever owns them and get permission. There’s a lot of data that’s acquired that people don’t know about and it’s not described. Description, registry and search - without command line.

AC: Making sure everyone is doing what they know how to do best. Bioinformaticians aren’t tied down doing things outside their expertise, biologists have access, researchers have access

MG: Having lots of worked out exchange standards for secondary analysis files. Want to share reads/BAMs, but secondary (summarized) data sets are very useful. Very little standardization now.

DH: technology moving so fast - have to be nimble. Have flexible standards/evolving. Up to speed to transfer/process/exchange data. APIs are important. Metadata is important. Require goodwill, work together to create standards. e.g. W3C - internet standards. Not easy.

LS: analytic pipelines are complicated and finicky. Small changes get dramatically different results. Projects like Galaxy and synapse - keep track of steps of a workflow. Track the output/input files - human and machine readable and reproducible.

DW: Any other points? Any prediction next 5-10 years

LS: In the next 10-15min, we’ll all enjoy a nice reception.

MG: sports genomics and superstar genomics

DH: I see turmoil and opportunity - research projects talking to each other at a large scale. Work with clinical world.

JM: Great promise for translation. we’re doing better at identifying the genetic variants and signatures associated with disease. Beginning to make progress on mechanism. Treatment is a greater challenge - hopefully it will come.

LS: The nature of the clinical trial is going to change - not just a single region/centre with 100 patients. Globally distributed clinical trials - networks of independent physicians. Patients with rare genetic variants enrolled. Precision genetics clinical trials.

LC: Hope: we can start answering basic biological questions and providing clinical outcomes

AC: Predict: tools will become more robust: Clinical applications - cancer will lead the way. Drug companies will combine genotype and phenotype data. The majority of sequencing will be cattle, plants ($2 a plant!)- humans are backwards.

Big Data in Biology: Imaging/Parmacogenomics

2014-04-01T06:05:41-07:00

Warning: These notes are somewhat incomplete and mostly written in broken english

Imaging/Parmacogenomics #

Tuesday, March 25th, 2014 1:00pm - 3:00pm
http://ks.eventmobi.com/14f2/agenda/35704/288362

Speaker list #

Susan Sunkin, Allen Institute for Brain Science, USA

Allen Brain Atlas: An Integrated Neuroscience Resource -
[Abstract]
[Q&A]

Jason R. Swedlow, University of Dundee, Scotland

The Open Microscopy Environment: Open Source Image Informatics for the Biological Sciences -
[Abstract]
[Q&A]

Douglas P. W. Russell, University of Oxford, UK

Short Talk: Decentralizing Image Informatics -
[Abstract]
[Q&A]

John Overington, European Molecular Biology Laboratory, UK

Spanning Molecular and Genomic Data in Drug Discovery -
[Abstract]
[Q&A]

Allen Brain Atlas: An Integrated Neuroscience Resource #

Susan Sunkin, Allen Institute for Brain Science, USA #

Abstract #

The Allen Brain Atlas (www.brain-map.org) is a collection of open public resources (2 PB of raw data, >3,000,000 images) integrating high-resolution gene expression, structural connectivity, and neuroanatomical data with annotated brain structures, offering whole-brain and genome-wide coverage. The eight major resources currently available span across species (mouse, monkey and human) and development. In mouse, gene expression data covers the entire brain and spinal cord at multiple developmental time points through adult. Mouse data also includes brain-wide long-range axonal projections in the adult mouse as part of the Allen Mouse Brain Connectivity Atlas.

Complementing the mouse atlases, there are four human and non-human primate atlases. The Allen Human Brain Atlas, the NIH-funded BrainSpan Atlas of the Developing Human Brain, and the NIH Blueprint NHP Atlas contain genome-wide gene expression data (microarray and/or RNA sequencing) and high-resolution in situ hybridization (ISH) data for selected sets of genes and brain regions across human and non-human primate development and/or in adult. In addition, the Ben and Catherine Ivy Foundation-funded funded Ivy Glioblastoma Atlas Project contains gene expression data in human glioblastoma.

While the Allen Brain Atlas data portal serves as the entry point and enables searches across data sets, each atlas has its own web application and specialized search and visualization tools that maximize the scientific value of those data sets. Tools include gene searches; ISH image viewers and graphical displays; microarray and RNA sequencing data viewers; Brain Explorer® software for 3D navigation and visualization of gene expression, connectivity and anatomy; and an interactive reference atlas viewer. For the mouse, integrated search and visualization is through automated signal quantification and mapping to a common reference framework. In addition, cross data set searches enable users to query multiple Allen Brain Atlas data sets simultaneously.

Notes #

10 years of work and >200 ppl contribution.

Allen Institute: primarily studying mouse & human #

largest publicly available neuroscience resource
gene expression to connectivity, cell type and circuitry
RNA-Seq
generated in standardized manner then mapped to framework
generated 3PB of data
mouse brain atlas - mouse spinal cord, mouse developing, then human brain, human dev brain
all data accessed through data portal http://www.brain-map.org/

Allen mouse brain Atlas #

genome wide cellular resolution atlas of gene expression in adult mouse brain - in situ hybridization
20K genes surveyed
informatics goals: aid search, navigation and visualization (make it easy to find what you’re looking for)

informatics pipeline: broken down to

preprossing
detection
alignment: mapped to 3d space- > where expression and how much in brain
griding
search
production - very product focused. Publicly available. Mine data and ask biological questions.. end with expression data matrix

Tools to harness data generated from the pipeline

3d viewing tool to view neuro-anatomy and 3d gene expression for one or multiple experiments
gene expression summaries
synchronization feature- same location different experiments
image tool etv- higher resolution image viewer. interactive 3d representation. probe and gene data available. histogram of expression energy. nice snapshot of expression, decide if they’ll do a deeper dive in info
Reference atlas -
- structure ontology
- anontated reference atlas place
- can look at experimental image and look up regions
grid data search - users can search over 25K datasets to find genes with specific expresion pattern
- differential search: high expression in one set (target) compared to contrast
- correlative search: find genes with similar spatial expression profile

Developing mouse brain atlas #

build on allen mouse brain atlas
pick genes for neural development
use reference atlas
create of 3d and 4d tools and data analysis
high qualitiy specimens selected, stained, generate images, annotate regions, make 2d and 3d output (adobe illustrator)
Search and analysis tools - pick 2d images and get extrapolated 3d expression
Imaging synchronization feature - variety of transcription factor targets
- select location as seed object
- will snychronize all the images you are looking at to the same location

Allen mouse connectivity atlas

high res map of neural connections in whole mouse brain. generate comprehensive db of neural projections. generate 140images per specimen at 100 micron intervals
one mouse brain after injected is embedded and placeed on stage two photon images taken, then brain moved over and section slice taken off. then another image taken. block face imaging throughout the entire brain
looking at fluorescent projects
spacially map brain to 3d reference model
comprehensive coverage for projection mapping - wt mouse but interested in cell type. projection profiling with cree-driver (sp?) mice
can look at trajectory and topography

Other tools - brain wide data - can pin point region of interest adn dive deeper

Allen Human Brain Atlas #

all genes - all structures. classical histology and neuroanatomy
cellular resolution data - scale. only looked at a subset of genes on a subset of structures (very question driven, autism, schizophrenia, etc)
not possible to process whole genome brain. generate large slabs - create a jigsaw puzzle and assemble at the end
generate histology data, neuranatomical regions of interest generated
LIMS system to assemble the puzzle
structural ontology - to generate summary stats
Search: search by gene or structure, neuroblast correlative search, differential serach
3D brain explorer
Tissue acquisition processing. postmortem brains. no neuropsychiactric disorder
MR Registration volume renderings: rigid and non-rigit registering had to be done
tissue sampling: slabs partitioned, sectioned and map back in MR space
tissue block to MR Registration: place landmarks on scans matched with corresponding image in 3d space

Developing human Brain project #

four main components

developmental transcriptome
prenatal microarray: hi res, 300 distinct structures
ISH: just a subset of regions/genes
reference atlases: few generated for this project (prenatal and adults), include histology and imaging data

Prenatal - LMD Microarray Data

fresh tissue frozen and slabbed
histology determines regions of interest
sent for hybridization to Agilent microarrays. same as adult data for x-comparison
display with online tool: anatomical view and heat map view

Q & A #

Q: (Stein) interested in how labour intensive human tissue blocks were- were the markers placed by hand?

A: Not for every Z level of the MRI, but yes labour intensive. Many steps in order to use the automated pipeline.

Q: (Schatz) at CSHL big study in exome sequencing - which of these genes are expressed in brain at various levels of development?

A: Use our API to pull out data from different datasets to produce that.

Q: Different imaging methods and approaches - what’s the Allen’s approach to presenting the information in some way that could be queries at different domains and at the cell level?

A: The level of registration is not down to cell - it’s domains.

back to the speaker list →

The Open Microscopy Environment: Open Source Image Informatics for the Biological Sciences #

Jason R. Swedlow, University of Dundee, Scotland #

Abstract #

Despite significant advances in cell and tissue imaging instrumentation and analysis algorithms, major informatics challenges remain unsolved: file formats are proprietary, facilities to store, analyze and query numerical data or analysis results are not routinely available, integration of new algorithms into proprietary packages is difficult at best, and standards for sharing image data and results are lacking. We have developed an open-source software framework to address these limitations called the Open Microscopy Environment (http://openmicroscopy.org). OME has three components—an open data model for biological imaging, standardised file formats and software libraries for data file conversion and software tools for image data management and analysis.

The OME Data Model (http://openmicroscopy.org/site/support/ome-model/) provides a common specification for scientific image data and has recently been updated to more fully support fluorescence filter sets, the requirement for unique identifiers, screening experiments using multi-well plates.

The OME-TIFF file format (http://openmicroscopy.org/site/support/ome-model/ome-tiff) and the Bio-Formats file format library (http://openmicroscopy.org/site/products/bio-formats) provide an easy-to-use set of tools for converting data from proprietary file formats. These resources enable access to data by different processing and visualization applications, sharing of data between scientific collaborators and interoperability in third party tools like Fiji/ImageJ.

The Java-based OMERO platform (http://openmicroscopy.org/site/products/omero) includes server and client applications that combine an image metadata database, a binary image data repository and visualization and analysis by remote access. The current stable release of OMERO (OMERO-4.4; http://openmicroscopy.org/site/support/omero4/downloads) includes a single mechanism for accessing image data of all types– regardless of original file format– via Java, C/C++ and Python and a variety of applications and environments (e.g., ImageJ, Matlab and CellProfiler). This version of OMERO includes a number of new functions, including SSL-based secure access, distributed compute facility, filesystem access for OMERO clients, and a scripting facility for image processing. An open script repository allows users to share scripts with one another. A permissions system controls access to data within OMERO and enables sharing of data with users in a specific group or even publishing of image data to the worldwide community. Several applications that use OMERO are now released by the OME Consortium, including a FLIM analysis module, an object tracking module, two image-based search applications, an automatic image taggi

Notes #

Representing consortium of 10 different groups US, UK, Europe
Outline:

Problem,
2 possible solutions,
sharing and publishing data,
directions,
imaging community,
publishing large imaging datasets

Problem #

image: cancer cell preparing to divide in mitosis.
In the early days, taking such an image was a big deal - huge improvement. detectors and computation power.
we take these images and work hard to get them on journal covers

BUT - the most important thing to understand:

every one of these pixels is a quantitative measurement
this is a temporally resolved measurement
easy to generate 50G of data in an afternoon. biologists are enterprise data generators
trying to use these images as measurement. this data should be a resource - collaboration, release the data to the community
the image problem is ubiquitous, electron microscopy, physiology, cells, in vivo, pathology, and more -> all major enterprise data generators
the scientists that use these technologies are not data scientists. they need these kinds of technologies and have ambition to make measurements at scale, but not tools

2 Possible Solutions #

aspire to build solutions that address all these domains

OME - towards image informatics

do not create new imaging tools, visualization
all about interoperability:
- some new imaging modality is developed and can be accessed by existing tools
- new method for image analysis can be run on existing modalities
- modalities are changing so quickly - standards are useless
- no matter what’s coming off this imaging system, some tool will be able to interact

OME - founded over lunch w/ cell bio

well plates becoming popular
people making microscope, chemical libraries and cell line -> no one is doing anythign about the data coming off
partner with other institutions - open source work (GPL license)
public road mapping, GitHub, continuous integration, Kanban
release:
- specification for data - OME-TIFF - open image data file
- bio-formats
- Omero, image-data management platform

Open data formats: spend time worry about OME data model (xml based specs for datatypes).
Around image acquisition events itself: model status of detector, lens, etc

Bio-Formats

simple and tedious: reverse engineer proprietary formats, java lib, read each one convert to common model
doing this for 10 years
we get data from the community
best collection of imaging files in the world: don’t have facilities to do anything other than hold this privately
installed 65K sites worldwide
2 FTEs working on this project
standardize interface to all formats

OMERO #

clients on top, servers on bottom
storage on images - relational for metadata: HDF5 based structure
text search
building Omero - solve a problem in a lab at ian institute repository, journal, national repo
idea is that we have to support as many client architectures as we can. Ice - middleware, used by Skype, great for large data graphs/binary data http://www.zeroc.com/ice.html
rich java client- Omero insight
- tree based files, thumbnail, region views
- client-server architecture - 300G of data viewed across the wire
- remote-access
web based view (x-platform)
high content assays - modelled in data model
digital pathology - tile based viewer. web based and java based on same api

results

treat result outputs as an annotation,
text based indexed with Luecene
large tabular results - relational HDF5

// accidentally closed my browser…//

sharing data: e.g. lab web page, few lines of js, embed viewer
institutional repo: publish paper, release data based on Omero based system
public resources: compiling dynamic data
PDB, EMDataBank - publishing with OMERO

Directions #

how do we build an application that can work in a rapidly changing field like imaging?

leverage the OME model
meta-compute -

example: using Galaxy, clinical data set - need a metadata management system

uses Omero underneath to store metadata
problem: every time there’s a new gene release needs to recalculate. change datamodel to handle metadata. also used Omero for histological images

Uses of Omero

Omero and ImageJ - plugins
MATLAB and Omero
Omero & u-track (custom object tracking software - MATLAB based)
Omero & FLIMfit - fluorescence lifetime
Omero.searcher
Omero & auto-tagging
- user trying to access data - scan data and pick up tags
- figure: when we submit figure to journals, wrestle with adobe illustrator
- always remove from original data structure and create jpeg - loose orignal context
- js based viewer - to keep linkage between representation of data and data itself. figure = js / not tiff
Omero and bioformats
- data import and access
- digital pathology and hi input screenings
- data will be written once (at multi TB scales) use Omero and pull image off directly - don’t copy data

Imaging Community #

Annual user meetings
active community of open source projects
working towards progress

Publishing Large Imaging Datasets #

publishing image to data: Perkin-Elmer’s columbus - Omero in a box

journal of cell bio - built JCB viewer - js

large image data
digital pathology to scale

phenotypic screening - hi content screens

many TB of data
published data, all authors call, genomic information
authors listed free text of phenotypes they saw
cell phenotype database @ EBI
- combines all publish hi content screens
- take manual author annotations
- create ontology: common way to annotate this data

More datatypes, more storage, more analysis

Q & A #

Q: (Schatz) a number of the image formats are copywrited, etc. What is your experience as you reverse engineer these formats? Legal problems?

A: Almost every commercial vendor, when they build a new imaging system they build a new image format. Just changing now. In general, if you look at the end user license - it will forbid you from reverse engineering. Does not forbid you uploading to us and we reverse engineer it. That’s what we do. Last few years - vendors coming to us - please make sure that this file format is support on the date that we release it. Sometimes they take our metadata specs and drop it into theirs. A lot is opening up and ppl are more willing to work with us.

Q: From a CS lab that does open source dev: you said you release everything GPL. We release everything apache - a lot of people in industry like it better. Why choose GPL? Feedback?

A: Short version: when we started, there wasn’t the richness is licenses. To be blunt, we want people to contribute. As the guy who has to pay an enourmous number of salaries, we’re fine when a company wants to use our software, but we need some way to keep the project going and feed everyone. We get a licensing fee from perkinelmer (closed) to help development.

back to the speaker list →

Short Talk: Decentralizing Image Informatics #

Douglas P.W. Russell, University of Oxford, UK #

department of biochemistry
member of open microscopy consortium

Abstract #

The Open Microscopy Environment (OME; http://openmicroscopy.org) builds software tools that facilitate image informatics. An open file format (OME-TIFF) and software library (Bio-Formats) enable the free access to multidimensional (5D+) image data regardless of software or platform. A data management server (OMERO) provides an image data management solution for labs and institutes by centralizing the storage of image data and providing the biologist a means to manage that data remotely through a multi-platform API. This is made possible by the Bio-Formats library, extracting image metadata into a PostgreSQL database for fast lookup, and multi-zoom image previews enable visual inspection without the cost of transmitting the actual raw data to the user. In addition to the convenience for individual biologists, sharing data with collaborators becomes simpler and avoids data duplication.

Addressing the next scale of data challenges, e.g. at the national or international level, has brought the OME platform up against some hard barriers. Already, the data output of individual imaging systems has grown to the multi-TB level. Integrating multi-TB datasets from dispersed locations, and integrating analysis workflows will soon challenge the basic assumptions that underly a system like OMERO. This is particularly true for automated processing: OMERO.scripts provides a facility for running executables in the locality of the data. The use of ZeroC’s IceGrid permits farming out such tasks in Python, C++, Java, and in OMERO5 even ImageJ2 tasks to nodes which all use the same remote API. However, OMERO does not yet provide a solution for decentralised data and workflow management.

A logical next step for OMERO is to decentralize the data by increasing the proximity of data storage to processing resources, reducing bottlenecks through redundancy, and enabling vast data storage on commodity hardware rather than expensive, enterprise storage.

Notes #

How OMERO can scale with big data, higher demand #

1) as scope and # of users increase, total data increases

one end: 1 user or small group of users
a user with minimal amt of sysadmin can instal and get it working
other end: national resources, institute: need a serious sysadmin team
tradeoffs:

2) Data set size: hight content screen

many images, each well, many dimensions
phenotypic data attached to each well
links to external genomic resources
all of this is a huge amount of data. One screen can be TBs in size

Once data is in OMERO - excellent data management tool

until you get it in there - need to make choices on how to put it in
smaller scale: input data and archive original image. extract metadata for search
when analysis needs pixel data - extract at runtime
in reality - users need access to filesystem where raw data is. Moving data around is unfeasible. now, extract metadata and reference to where raw file is. helps with data duplication problem
preferable to store data in read optimized format. trade some operation efficiency for some possible data loss

OMERO services #

all run on Ice
http://www.zeroc.com/ice.html
process, indexer, and more - all on ice

Ice gives us the capability to distribute some services to other hosts

pretty seamless - can take advantage of local compute
can do this multiple times to access more compute resources
but then each has to communicate back to original
>> decentralizing omero

Decentralized

access data directly - both servers can access resources (filesystem) directly
once we have that, we can scale - more servers
this has the potential to address image management on scaled level
can deploy man Omero components on many hosts - make more powerful, absorb volumes of data
can take advantage of cloud computing - can scale permanent or temporarily - spin off more hosts
will be necessary to augment Omero’s resources with distributed filesystems - store huge amounts of pixel or image data
can also make use of Cassandra clusters - caching frequently accessed data. much bigger scale

That how we’d like to cope with big data in Omero but make it accessible for single user who wants to install it locally

github.com/openmicroscopy

Q & A #

Q: (Schatz) are you considering map-reduce or just storage?

A: we could definitely use them both, yes

back to the speaker list →

Spanning Molecular and Genomic Data in Drug Discovery #

John Overington, European Molecular Biology Laboratory, UK #

Abstract #

The link between biological and chemical worlds is of critical importance in many fields, not least that of healthcare and chemical safety assessment. A major focus in the integrative understanding of biology are genes/proteins and the networks and pathways describing their interactions and functions; similarly, within chemistry there is much interest in efficiently identifying drug-like, cell-penetrant compounds that specifically interact with and modulate these targets. The number of genes of interest is of the range of 105 to 106, which is modest with respect to plausible drug-like chemical space - 1020 to 1060. We have built a public database linking chemical structures (106) to molecular targets (104), covering molecular interactions and pharmacological activities and Absorption, Distribution, Metabolism and Excretion (ADME) properties (http://www.ebi.ac.uk/chembl) in an attempt to map the general features of molecular properties and features important for both small molecule and protein targets in drug discovery. We have then used this empirical kernel of data to extend analysis across the human genome, and to large virtual databases of compound structures - we have also integrated these data with genomics datasets, such as the GWAS catalogue.

Notes #

Chemistry. Mapping of Chemistry - interface of chemistry with genomic and drug discovery data.

Background #

chemical space: how big is the chemical space. GBD-13 - all possible molecules (stable) with up to 13 heavy atoms

1B structures
largest small organic databases
GDB-17 - 166B structures - not available. Intellectual property issues

not all molecules can be drugs - needs to be bioactive

physical properties access to ‘target’
ADMET - absorption, distribution, metabolism, excretion & toxicity

Lipinski - a molecule given these parameters was likely to have good oral drug prop. http://en.wikipedia.org/wiki/Lipinski’s_rule_of_five

different for topical and parenterally dosed drugs
pretty good guide

10^19-23 libpinski like small molecules - potential drugs

around 21-23 peak in curve - size of heavy atom counts for drugs.
drug discovery - making molecules slightly larger than they need to be

GDB - 30% of all known drugs ?

Targets : homo sapients 21K genes.
Only 1% of genome is a drug target - we’ve been able to develop drugs against.
we’ve tried many many more

Chemogenomics = chemistry + genome derived objects #

exploration of small molecule bioactiviy space at genomic scale
possible space: 10⁶ (targets), drug target proteins 10²
drugs: all reasonable 10^22, screened: 10⁷
similar compound structures have similar functions

ChEMBL - training set; largest db of medicinal chemistry data 1.4M compounds #

adding plant data later this year
open
download/access - db dumps, semantic web rdf - SPARQL, virtualization (ChEMBL appliances)
ChEMpi - raspberry pi
data comes from the literature - extract structures fromteh text, link to assays, link to sequence, store functional data. allows to chain targets to phenotypic effects
quantitative data
target types: single gene - all the way to - organisms
compound searching - matching structure space (2D blast)

different drug structures - ligand efficiency

drugs are efficient, every atom counts - avoid lipophilicity
interested in balance between binding efficiency and molecular size
target class data

assay organism data

differences between animal model and the effects of compounds in humans
failure in pre-clinic - works in animal models, but not humans
trying to understand systematic reasons

SureChEMBL - acquired SureChem

new public chemistry
extends coverage of chemical structures from full-text patent 15M structures
add target, sequences, disease, animal model, cell-line

Compound Integration

ChEMBL - literature
SureChEMBL- patent

Different Types of Drugs

2/3 drugs are small molecules
in late stage development - majority are small molecules
Therefore, focus on small molecules for drug discovery

Visualizations #

Polypharmacology via binding sites: majority of pharmacological activity focused on brain
Affinity of drugs for ‘Targets’: drugs are weaker than we think - penalty for tight binding drugs
Clinical Candidates: coverage of clinical development candidates -
Selectivity - circos plot: map promiscuity across tree

Pharma Productivity problem #

biotech boom
productivity has fallen off the cliff

how many compounds does a company need to make before they develop a compound

100K compounds synthesized to develop drug
now 32x that to get a potential drug
Now: pharma needs an average to synth and test 250K compounds for each launched drug. not sustainable
Trying to be smarter, use db, to help with this

Cancer Drugs and Targets #

taking ChEMBL and thinking of drug discovery in a cancer setting
huge investment in genomic studies looking for genomic variation - causes of cancer. sequencing, find driver genes, look at other datasets, find overlaps

come out with option of potential targets

how do you select from these?
we can compare against things we had in the past
majority of the success from the past we would not have discovered using genomic sequencing techniques
canSAR - large scale integration of public and propriety data built on top of ChEMBL - select compounds likely to be good https://cansar.icr.ac.uk/

Q & A #

Q: (Ouellette) finding out the chemical structures of various organisms; What about Micro-biome space?

A: Different animals have got different physical space for drugs they like. Controversy in literature - physical space for antibiotics. Micro-biome - fascinating - orally, also bacteria and guts. Effect of microbiome by gut bacteria - sometimes needed to activate substance

Q: (Stein) Curious about 1B+ compounds in GBD-17. Can’t release because of IP? Algorithm or structures?

A: Just too big. Drug discovery community -
Can publish the structures of all possible drugs=> can’t patent that - so will destroy all possible intellecual property.

Q: For compounds w/ rich sequence information (transcriptome wide/proteomic) is it integrated?

A: yes and no, transcript microarray data goes in GO or express. Links - compounds in ChEMBL. Reality - very small numbers right now. ChEMBL part of a suite of resources at EBI, link to other resources.

Q: Is there a way through ChEMBL to discover drugs that are potentially synergistic? Drugs with same structures and hit same targets. Connectivity map? X-ref between ChEMBL and connectivity map?

A: One of the most common uses of ChEMBL. combine drugs against the same targets. No links to connectivity map - people have done that.

back to the speaker list →

Big Data in Biology: Personal Genomes

2014-04-01T06:05:22-07:00

Warning: These notes are somewhat incomplete and mostly written in broken english

Personal Genomes #

Tuesday, March 25th, 2014 8:30am - 12:00pm

http://ks.eventmobi.com/14f2/agenda/35704/288359

Speaker list #

Lincoln D. Stein, Ontario Institute for Cancer Research, Canada

The International Cancer Genome Consortium Database -
[Abstract]
[Q&A]

Ajay Royyuru, IBM T.J. Watson Research Center, USA

Genome Analytics with IBM Watson -
[Abstract]
[Q&A]

Mark Gerstein, Yale University, USA

Human Genome Analysis -
[Abstract]
[Q&A]
[slides]

Stuart Young, Annai Systems Inc., USA

The BioCompute Farm: Colocated Compute for Cancer Genomics -
[Abstract]
[Q&A]

Adam Butler, Wellcome Trust Sanger Institute, UK

Short Talk: Pan-Cancer Analysis of Somatic Variation from Whole Genome ICGC / TCGA Datasets -
[Abstract]
[Q&A]

Maya M. Kasowski, Yale University, USA

Short Talk: Extensive Variation in Chromatin States Across Humans -
[Abstract]
[Q&A]

Robert L. Grossman, University of Chicago, USA

Short Talk: An Overview of the Bionimbus Protected Data Cloud -
[Abstract]
[Q&A]

The International Cancer Genome Consortium Database #

Lincoln D. Stein, Ontario Institute for Cancer Research, Canada #

Abstract #

The International Cancer Genome Consortium (ICGC; www.icgc.org) http://www.icgc.org/ is a multinational effort to identify patterns of germline and somatic genomic variation in the major cancer types. Currently consisting of 71 cancer-specific projects spanning 18 different countries, ICGC has sequenced the tumor and normal genomes of over 10,000 donors (>20,000 genomes). When the current phase of the project is completed in 2018, we expect to have sequenced more than 25,000 donors.

All analyzed data from the project is available to the public, including clinical information about the donors, somatic mutations identified in the tumors, and the potential functional significance of these mutations. The raw sequencing data and other potentially-identifiable information is available to researchers who have signed an agreement promising not to attempt to identify the donors. The total data set is now 500 terabytes in size, but growing rapidly as the project switches from exome sequencing (sequencing just the transcribed regions of the genome) to whole-genome sequencing. We anticipate that the full data set will be on the order of 10 petabytes.

To maximize the utility of the data to the public, the analyzed data is available at the ICGC data portal (dcc.icgc.org) http://dcc.icgc.org/, where users can browse donors, mutations and genes using an attractive highperformance web application based on Elastic Search at the backend and AngularJS and D3.js on the front end. The portal uses faceted search as its dominant user interface metaphor. This allows researchers to pose general queries, such as “find all non-synonymous mutations” and then successively refine them “…affecting genes in the hedgehog pathway”, “…affecting donors with stage I disease.” A series of interactive graphics allows researchers to readily compare different sets of mutations, donors and genes.

A limitation of ICGC is that the raw sequencing data must still be downloaded from a static file repository. We are addressing this limitation by moving the data into the compute cloud, where software and data can be co-resident. In the Whole Genome Pan-Cancer Analysis Project, which began earlier this year, 2000 whole genome pairs from ICGC are being placed into several compute cloud analysis facilities to allow for uniform mutation-calling and data mining by ICGC researchers. In the “Cancer Genome Collaboratory”, a project just approved in March 2014, we will be placing the entire ICGC data set into two compute cloud centers for access by the general research community. I will talk about the challenges and solutions that we are working on in connection to these two projects.

Notes #

ICGC Project

International Cancer Genome Sequencing Consortium
5th year of operation
multi-national collaboration
Includes all of the TCGA projects
Goal: Identify the common patterns of mutation in all major cancer types

Simple experimental design:

take normal (blood) and tumour (biopsy) samples from a series of donors
sequence
identify cancer-related mutations
relate mutations to tumor bio
translate this knowledge to improved diagnosis and treatment & make avail

ICGC db growing in size - moved from exome sequencing to whole genome

10K+ donors
4M+ somatic mutations
49K CNVs
6K+ methylation profiles

Available to public - Website @ http://dcc.icgc.org

very nice data browser
faceted view of various data types and donor types
changes in a context sensitive way
updates list with dynamically updated graphs/summary
links to raw data @ CGHub
view most mutated genes in selected cancer subtype. Can keep drilling down through stats/projects. Or look at summary - transcript level / protein level.

Original Database - based on BioMart

mysql based data mart - developed and used by EnSEMBL project
de-normalized data schema (reverse-star schema)
scaled well for human and other invertebrate genomes
worked well until release 12
One problem: as the data got larger, BioMart didn’t scale
Release 8 & 9: three month release cycle (freeze, prep, load, QC)
by release 11 - load phase taking 2-3 months! Missing release window. Were announcing new freeze before new db released

September - complete rewrite of entire dcc (Ferretti). Heavy use of distributed computing.

Process:

genome centres submit flat files + meta
validation (Hadoop cluster - HDFS distributed filesystem)
loaded into MongoDB (on cluster)
Combined w/ other info (gene annotation from Ensembl, uniprot, cosmic, etc)
Indexed by ElasticSearch (another cluster)
Indexed info stored in mongo - drives the portal
Total time for loading for release 15: 42 hours (not yet optimized)

What about raw read data?

~10 PB Genome data by 2018
depositing all genome data in EGA. In theory, researches go to EGA and dl data. In practice, data too large. Takes too long.
will soon be completely inaccessible - except maybe for some large groups, or those located in the UK
This is an important legacy dataset that can still be mined
Current mutation calling algorithms not perfect. Different groups have low overlap. Different filtering systems. Many false positives (e.g. titan). Our ability to predict gene rearragements quite poor.
want to go back to the data to get more info as our algorithms improve

The solution => The Pan-Cancer Whole Genome Analysis Project (PAWG/Pan-Can) #

Goal: understand what’s going on in the 95% of the cancer genome that isn’t protein-coding
Resources: 2K whole genome tumor/normal pairs from ICGC
Analytic issues: calling cancer mutations in non-coding regions is an evolving art. Need uniform pipeline. Dataset - 0.5PB.
Cloud based approach - six cloud compute centres in USA, Europe, Asia
Phase 1: Partition data among the data centres. Perform alignment and mutation calling in a distributed fashion
Phase 2: Synchronize alignments and mutation calls. Each centre will have complete set of alignmetns and mut calls
Phase 3: Open up (subset) of of clouds to allow researchers to do analysis

Technologies: OpenStack (5 centers) and vCloud (EBI)

Vagrant - vm abstraction layer (make clouds look similar)
network transfer and metadata - GNOS / GeneTorrent (from Annai Biosystems Inc) - commercial solution
Workflow management - SeqWare pipeline manager (OICR & UNC developed - O'Connor) synapse from sage

Status

Ethical approval, usage agreements signed - Legal
OpenStack/VMware, vagrant SeqWare installed
alignment workflows executed on some vms

Challenges

Legal - regional differences have not gone away. Datasets from TCGA (us) can be hosted by certain US based institutions trusted by NIH. NIH has not approved phase II of the project due to the way the consent was written. It can be interpreted as ‘not allowed to use on cloud’ (But cloud didn’t exist when the consent was written). Europe - some countries are sensitive to distribute their data to US based data centres (Snowden & NSA).
Technical - adapting grid based hpcs to use cloud-based technologies. Running 8 weeks behind

Why not a commercial cloud? Amazon, Google, MS

legal and ethical issues
preliminary ethics approval to ICGC. Some restrictions - can’t cross regulator borders without notice
NIH reviewing approval for TCGA sets

What happens when Pan-Can is done ~ 1 year? The group has received funding from Canadian funders: The Cancer Genome Collaboratory

long-lived private cloud compute centre, pre-populated with ICGC datasets
any individual can create an account and access the data via api
have an integrated benchmarking core, bioethics, community outreach
Initially two physical data centres (w/ Grossman in Chicago) & Toronto. Connected by high speed link
Funded as of March 1

Q & A #

Q: (Ware) Many of us have been using BioMart and the scalability - how portable is your new system as a replacement for BioMart?

A: on a scale of 0 - 100: -1. This is a highly specialized system designed just to work with our data. Biomart is alive and well in Italy

Q: What cancer types were chosen for the pan-cancer analysis? And why?

A: Our criteria for inclusion is at least 30x coverage for whole genome, tumor normal pair, proper consent from donor.
Of that, we have ovarian, breast, lung, pancreatic, liver, leukemias – about 13 in all
The final list of tumor types won’t be selected till we’ve qc'ed al the data and know what the distribution is

*Q: If the 10PB of data that will be generated will be harmful - look at quality compression and other *

A: No chance that we’ll be storing adn distributing full uncompressed 10PB. Actively benchmarking compression systems. Hopefully get it down to a few PB without loss of information

Q: What is the main objective of this project? Biological objective?

A: The main biological object - focusing on patterns of alteration in non-coding regions. E.g. know there are mutations in regulatory regions - we haven’t characterized.
groups looking at:

Looking at regulatory networks - interactions wiht coding regions.
Patterns of rearrangement
Evidence of insertion of known and unknown pathogens / virus that may be driving the tumours

Looking at this in a uniform way we’ll learn common mechanism and mechanisms that are distinct

Q: How willing are your users to get random samples in return as opposed to the full data? Plus confidence score

A: Key method of access - take slices of the raw data in the region that you’re interested in. Or extend and do a random sampling - feature available of CGHub and widely used. Not a feature of EGA - annoying deficit. One of the reasons we want to move away.

Q: Majority of researchers - don’t need to develop alignment algorithms. Are processed data available to researchers?

A: The interpreted data (still large, but much smaller - in GB not TB) is available for browsing and dl and abstraction and available from http://dcc.icgc.org

Q: Curious how you are designing your APIs? APIs for visualization are different from tools

A: Start with the user interface, figure out what it needs to display, and work back to the API. A genome browser has a very different api than the faceted browser where you’re looking at a particular biological pathway. Specialized APIs and indexes for each of those.

back to the speaker list →

The Genographic Project #

Genome Analytics with IBM Watson #

Ajay Royyuru, IBM T.J. Watson Research Center, USA

Director of computational biology

Abstract #

// last minute topic change, no published abstract

Press release: http://www-03.ibm.com/press/us/en/pressrelease/43444.wss

Notes #

Research group at IBM - very focused on computational biology.

Intersection of everything IT and Life Sciences.

3 pillars of work (IBM computational biology)

managing and analyzing the data explosion - makes biology more amenable to quantitative outcomes
predicting biological outcomes with scale of computing
dealing with complexity. DREAM - IBM team with community is heavily involved

Why:

Intruiged by connections made yesterday (DH, JM)
Sequencing is reaching a point where we have to look at the translational aspects
beginning to make an impact in teh clinic
takes a community
IBM Watson - can be used here
On IBM’s cloud system - rapidly scale. The sorts of analytics capabilities - it begins to be scalable and accessible so it can have the impact on the clinic down the road

What are we up to: Gathering raw sequencing input, through large number of steps so that we will eventualy get useful info that may lead to action

3 pillars in the journey of genomic medicine

sequencing (includes downstream analysis - variant calling)
translational medicine (have VCF) <– will focus on this piece (VCF to actionable)
Actionable intelligence - Personalized healthcare. Something publishable is our goal

Translational Medicine: #

System that generates insights

Input:

data coming from sequencing (VCF) - patient specific information
Entirety of what you can point Watson to - All available biological knowledge (PubMed, NCI PDQ)

All this is ingested. Running on IBM’s cloud layer (SoftLayer) - large/global/scalable/acquired by IBM.
Generates some actionable insights.
Goal: this goes to tumor oncologists, look at data in context of decision trying to make. Hopefully make informed correct decision.

IBM Watson #

began 2008 - research project
Jeopardy - grand challenge (got attention)
Added genomics capabilities!

Genomics - not just about genes. How we connect that knowledge #

The traditional way: read papers, develop hypotheses -> interpretation -> actionable output. Can we automate this? Can we come up with new research approaches from the literature?

p53 project example - ingest a lot - mine the literature. #

lots of text, natural language, analytics happening
specific to diseases, compounds (drug molecules)
Human readable sentences - use Watson based technology to translate the information into machine readable. ‘the results who that EPK2 phosphorylated p53 at Thr55’ - extract info with Watson
Extraction is working

Application to genomics:
on SoftLayer, physican managing cases (biopsy samples) submission - uploading VCF.
What analysis can be done -

circos representation, where they occur, where translate to
map to available info on pathways
what more can you find in liternature, Watson? - adds links (to literature) from text mining. Can drill down and find out why links were generated
Drugs - targetting pathways: added in datamodel

Summary: researcher can browse. print report for the record.

see provenance of the data and keep a record of it
see all visualizations, records, summary
possible list of all possible drugs, status (approved?)
this insight is available to the research

Looking for active collaborations - dont’ generate this data themselves

last week: partnership with NY genome centre (collaboration of research centres in NY area). Can take this technology and apply it with them. Get practical use of this technology
Not exclusive to NY genome, can open collaborations with others

Sample report- generated with early data

TCGA GBM data - reshaped to put in system
generated report (many pages long)
list of drugs with reasons why the drug is contextually relevant

e.g. Lidocaine in report: not prepared to see this in here

showed to oncologists - click through to evidence. Watson points to papers - Lidocaine assay on cancer cells (tongue, EGFR receptor). Lidocaine being tested in context of thyroid cancer cells
so this is not out of the realm of what we should be thinking about
helps us be current and comprehensive

Q & A #

Q: (Ouellette) Do you have any evidence on how Watson will do if it read full papers (not just abstracts)?

A: Not tested in this context. Watson does read full papers in a clinical context

Q: (Mesirov) -

1. Are you aiming with that package towards the practicing oncologists or the research physician?

2. To what extent have you compared what Watson is able to mine from the data with other approaches/algorithms/packages published and available to the community?

It’s a journey - early adopters, research clinicians who have the expertise and interest to be partners. A lot of learning. For example, Watson shows lots of evidence. You need a clinician research who understands the subtleties of the research and how to make decisions that will be useful
Not whole scale comparison yet - still in ingest and build mode. Some benchmarking and testing - working on the baseline. Full scale comparison for later. Watson can also do chemical extraction - full scale comparison here.

Is there any way to integrate other sources of information not text based? Images? Protein structures?
human value added in human curation databases?

Image analytics is an interest to us. Study going on here. Working with some large medical institutions on this project.
Melding between machine and human curation -> this accelerates the process. Makes it more usable.

Q: Doubts whether practicing physician will know what VCF is, understand Cicos plot? Watson to user or user to Watson?

A: Initial set of end users - clinician researchers. They got the sample, they know what VCF is. This is the community that will find this useful. What can we simplify to make this more useable.
Right now, collaboration.

back to the speaker list →

Human Genome Analysis #

Mark Gerstein, Yale University, USA #

Director: computational biology
ENCODE, 1000 genomes

Abstract #

Plummeting sequencing costs have led to a great increase in the number of personal genomes. Interpreting the large number of variants in them, particularly in non-coding regions, is a central challenge for genomics.

One data science construct that is particularly useful for genome interpretation is networks. My talk will be concerned with the analysis of networks and the use of networks as a “next-generation annotation” for interpreting personal genomes. I will initially describe current approaches to genome annotation in terms of one-dimensional browser tracks. Here I will discuss approaches for annotating pseudogenes and also for developing predictive models for gene expression.

Then I will describe various aspects of networks. In particular, I will touch on the following topics: (1) I will show how analyzing the structure of the regulatory network indicates that it has a hierarchical layout with the “middle-managers” acting as information-flow bottlenecks and with more “influential” TFs on top. (2) I will show that most human variation occurs at the periphery of the network. (3) I will compare the topology and variation of the regulatory network to the call graph of a computer operating system, showing that they have different patterns of variation. (4) I will talk about web-based tools for the analysis of networks (TopNet and tYNA).

http://networks.gersteinlab.org

http://tyna.gersteinlab.org

Architecture of the human regulatory network derived from ENCODE data.

Gerstein et al. Nature 489: 91

Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors.

KY Yip et al. (2012). Genome Biol 13: R48.

Understanding transcriptional regulation by integrative analysis of transcription factor binding data.

C Cheng et al. (2012). Genome Res 22: 1658-67.

The GENCODE pseudogene resource.

B Pei et al. (2012). Genome Biol 13: R51.

Comparing genomes to computer operating systems in terms of the topology and evolution of their regulatory control networks.

KK Yan et al. (2010). Proc Natl Acad Sci U S A 107:9186-91.

Slides #

http://lectures.gersteinlab.org/summary/Big-Data-in-Genome-Annotation-Using-Networks–20140325-i0keybdata/

Notes #

My perspective on Big Data #

buzz word, data science
HBR - data science the sexiest job of the 21st century (http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ar/1)
transforming science
explosion of data in genomics - sequencing price going down faster than Moore’s law. Cost is in management of data
Current state of large sequencing dataset TCGA 910 TB in CGHub, + smaller datasets

What do people do with big data? #

Take this data to answer a question, make a prediction, modelling

Two ways to approach:

don’t care about structure, just want answer (google search)
with explicit organization of dataset (google maps, google earth)

In science - search for Higgs boson - searching through many for a few needles (fits in #1)

In genomics - we’re in #2

we want to make a map of the molecular world we have
but we don’t have an immediate metaphor we can hang all our information on
but we don’t know what the structure of that map is
ENCODE - thought about the structure of the map. Layer information down
genomics has been around for a while - one of the first big data disciplines. Inspired by pandora - music genome project which was inspired by how geneticists organize information. We should learn from other disciplines

How we can organize information in genomics - networks #

regulatory networks as a hierarcy
more connectivity - constraint

What is genome annotation? #

Tracks in genome browser - linear view of how to think of genome.
How will this scale with thousands of tracks? No

What type of information do we want? Actually thinking of 3D molecules - but not quite possible

Network diagram - middle ground

works for cancers/biology pathways
compelling approach to big data
Example: we started off with linear annotation (ChIP-Seq experiments)
Then, created proximal edge at peaks.
Generated a hairball of .5million edges, paired down to 25K edges.
Many edges far away from genes - distal sites.

analyze networks - network science

Hub - point with many neighbours
bottleneck - max # of shortest paths
Identify bottlenecks & hubs (like roads, bridges can be bottlenecks)

Directed entities - regulatory networks

one thing regulates another
Hierarchy - intuitive - people understand this
optimally arrange transcription factors (ENCODE) into 3 levels by simulated annealing, maximizing downward pointing edges
higher bottleneck-ness in centre layer - information flow
Can think about molecules - does this make sense for molecules.
Integration of TF hierarchy with other ‘omic information.
More connected and influential on top
Same thing with miRNA networks (bi directional)
Can look at how transcription factors are working together. Pick two, can look at the degree they co-regulate the target

Other organisms: Yeast genome #

Similar, but has four levels. Multi-regulated network with bottlenecks

Different types of hierarchies

autocratic (military)
democratic (things at top mostly regulating, bottom mostly being regulated)
intermediate - between the two. Ease some information bottlenecks

Developed a scheme to measure the degree of x-linking structure. Degree of collaboration

number of overlapping
find over many organisms: get a lot more confidence that inclusions are true
middle layer has highest degree of collaboration

Compare humans w/ E. coli & yeast & rat: humans more collaborative nodes

Yeast network similar structure to government hierarchy w/ middle managers: matches gov’t of Macao

Social science - literature on people studying how important you need middle managers talking to each other

Variation network

map all SNPs in 1000 genomes on network
more SNPs at bottom
higher parts of hierarchy more conserved, less variable
Trend: more hubs - less variation/ more connectivity, more constraint.
Seen in many studies/organisms.
Human protein-protein interaction network - rapidly changing on the outskirts

Analogy to understand more connectivity -> more constraint #

Comparison between e. coli regulatory network and Linux OS

call graph in linux compared to e. coli regulatory network
linux is top heavy in comparison
E. coli: dominated by out degree hubs - turn on a lot of molecules
linux: dominated by in hubs - routines called by many programs
linux OS evolves - we can watch it through each of its releases
plot changes & compare.
E. coli: less change.
Linux: certain that don’t change, some things change constantly. Some releases coupled to hardware, has to change
In biological system - negative correlation connectivity is less change
In linux - positive correlation - connectivity is more change
Perspectives on random change v. Intelligent Design.
Intelligent designer - they believe they can make changes where there is a lot of constraint and connectivity.
If changes are random - best to not put them in central points

Applications of more connectivity leads to more constraint - no time to talk today. Building a practical workflow & tool for disease genomes.

Network stuff available - encodenets.gersteinlab.org

Q & A #

Q: (Stein) you showed this relationship between Hub-ness and Kernel call graph. Have you looked at the evolution of the call signature? Highly connected subroutines do not have their call signature called frequently - more similar to bio

A: No, very interested in that. Evolution - even package dependencies.

Q: Information flow: makes sense in regulatory networkers. What’s your reasoning with protein-protein networks?

A: Some times of protein-protein interaction networks, but other times not so much. Key network params - regulatory, focused on bottlenecks. Protein-protein - focus on hubs. When you do the correlations of connectivity with constraint - more on bottlenecks.

Q: Interested in E. coli v. linux - we compare a lot to engineering ideas

A: Maybe not a lot of engineering ideas apply to biology. Sometimes people look at biological networks to apply to engineering problems

Q: have you looked at hubs in organisms with recent genome duplications to see how they occur?

A: genome duplicates, suddenly have these two things interact with your hub or what’s there. Lots of network literature on scale free networks - plays into that.

Q: What do you think about the cell type specificity - do you think different cells depending on their needs will have different hierarchies?

A: Controversy in how I present this. Cell type non-specific hierarchy - this is a global wire diagram. In my mind, if you go to certain cell time, certain lights turn on. Other view - cell type specific hierarchies. I think this doesn’t make sense - no one talks about gene list

back to the speaker list →

The BioCompute Farm: Colocated Compute for Cancer Genomics #

Stuart Young, Annai Systems Inc., USA #

Abstract #

Pedabyte-scale genomic data repositories such as the Cancer Genomics Hub (CGHub) require collocated compute resources to fully leverage the value of the genomic data. The traditional model of data download from a repository to a research center followed by local computational analysis suffers from high file transfer costs, significant delays and file storage problems. The BioCompute Farm, a highly-scalable computing resource colocated with CGHub, provides a 99.9% reduction in data storage and 120 times reduction in time for analysis of all 40TB of the current Cancer Genome Atlas (TCGA) RNA-Seq data set. The BioCompute Farm combines high-speed BAM slicing for DNA analysis and the latest in bioinformatics tools and standardized pipelines with the flexibility to customize pipelines and rapidly scale up computational capacity to meet the needs of cancer researchers. As data growth continues to outpace the growth of Internet bandwidth, the BioCompute Farm can serve as a model for the emerging paradigm of colocated compute resources serving the users of large genomic databases.

Notes #

Motivation for talk: why colocated compute #

'07/'08 - next gen suddenly became a viable product
before this, fairly expensive Sanger sequencing
soon - began to overshoot the cost of storage and bandwidth
only will become worse
to address this: need to provide a solution to provide capacity and service

Annai systems: director of bioinformatics #

Software underpinning CGHub - Annai-GNOS
server to genetorrent - dl sequences
bioCompute -colocated w/ CGHub

How big is this problem?

TCGA data ~ 1PB, -> 2.5 in the next few years
download rates: several months to download it all. Store it. Need infrastructure.
researches limited by financial and logistical constraints (IT)

Survey by NCI - wish list for cancer genomics researchers

#1 Run workflows on data in cloud (13%)
Annai covered about 50% of what they want. Maybe biased sample (online)

NCI’s colocation model

Genomic Data Commons - integrate multiple datatypes, provide API
Cloud Pilots - $20M, colocated compute. The successful bidders will provide workflows and be scalable

BioCompute Farm (TCGA data)

what they’re doing with sequencing - shifts cost of sequencing to getting data and results out
upstream costs: technology development, pipelines, bioinfo tools
downstream costs: tools for sequence analysis, management of

/// LOST CONNECTIVITY FOR AWHILE/ //

HIPPA Compliance

wholistic expectation - bookkeeping where access is controlled
Physical security: Cage in SDSC - monitored, power, alarms

Provide farms with subscription based access

Provide custom analysis

farm loaded with standard pipelines: broad GATK, PanCancer BWA alignment
Custom Pipelines - latest versions
Workflow tools: SeqWare (O’Connor), agua, synapse
Use Case Baylor - BAM-slicing of TCGA RNA-Seq data
- would have taken 9weeks of dl time + storage (no capacity)
- They used biocompute farm, used bam =0slicing of CGHub bam files on Annai’s GTFuse
Pipeline Optimization - look at runtimes, will this benefit w/ parallelization or throwing more cpu?

Collaborations #

PanCancer project

prototype of global federated colocated compute
setting up servers, SeqWare,

DREAM challenge

variant calling
Annai provides GNOS platform for data security and download

ShareSeq

hosting ICGC- common free access to download free data
provide colo-compute

Conclusion #

colo compute is a no brainer
useful functionalities - fast access, flexible use, tools for workflow, and custom analysis and scalability

Q & A #

Q: Only 5 or 10 labs in the world are interested in whole PB scale data. I think if we make the VCF file available - this should be sufficient for most researchers.

A: I think with the way things are going, the issue is not only going to be huge data access, but secure access, and how can we search through the data to find the datasets you want.

Q: Most of the pipelines are focused on variant calling, alignments - what are the priorities for what’s next?

A: Yes, it’s variant calling right now. One other area of interest- systems approach, pathways, integrating different types of data. Looking at different standards, read pathology or clinical data. Hospital data is very rich for researchers, but not very accessible. Looking at integrating with genomic data.

back to the speaker list →

Short Talk: An Overview of the Bionimbus Protected Data Cloud #

Robert L. Grossman, University of Chicago, USA #

Abstract #

Bionimbus is a petabyte scale community cloud for managing, analyzing and sharing large genomics datasets that is operated by the not-for-profit Open Cloud Consortium. With a cloud computing model, large genomic datasets can be analyzed in place without the necessity of moving it to your local institution. Bionimbus contains a variety of open access datasets, including ENCODE and the 1000 Genomes dataset. In 2013, we updated Bionimbus so that researchers can analyze data from controlled access datasets, such as The Cancer Genome Atlas (TCGA) in a secure and compliant fashion. We describe some case studies using Bionimbus, some of the bioinformatics tools available with Bionimbus, some different ways of interoperating with Bionimbus, the Bionimbus architecture, and the security and compliance framework.

The Bionimbus Protected Data Cloud is supported in by part by NIH/NCI (grant NIH/SAIC Contract 13XS021 / HHSN261200800001E), the Gordon and Betty Moore Foundation, and the National Science Foundation (Grants OISE - 1129076 and CISE 1127316).

Notes #

I’m going to pose a few questions. In the next 10 min I will not try to answer them. Hopefully your answers will be more interesting than mine. I will give you a framework of how we think of big data.

Four questions #

Is big data in bioinfo/biomed any different than big data in science. Is big data in science any different from big data general?
what instrument should we use to make discoveries over big biomed data?
do we need new types of mathematical and stat models for big biomed data?
how do we organize our data?

Bionimbus protected data cloud #

Supporting Pan-Can analysis - open source core

interoperate with as much proprietary as they can
log in with NIH/eRA credentails - immediate access to TCGA data
pipelines, analysis, install your own software

Right now process of scaling up

10-20 projects a month
contain TCGA data- operate on PB scale
sometime next week, another PB of data & 16K cores, ICGC Pan-Can analysis
question: how do we make sure, on this limited resource, we get the most science out?. Traditionally handled by allocation committees
this month, would have cost >$3K on amazon

Open science data cloud #

support integrative analysis: Can look at how disease is impacted by socio-economic factors and more. Text analytics & geospacial analytics
4 years old (Bionumus 1 year)

biomedical commons cloud

involves cancer centres, open source core but operates with proprietary software around it
want to peer at scale with other providers (biomed commons providers)
like how internet was started with tier one ISPs
sometimes faster to get data at high performance network than over disk with certain protocols

New era #

'05-'15: bioinformatic tools and integration (Galaxy, GenomeSpace, workflows, portals)
'10-'20: data center scale science (Bionimbus, CGHub, cancer collaboratory). At that scale what changes and how do we build models
'15-'25: new modelling techniques

What are the new models? '72 phil anderson wrote a piece: is more different

http://robotics.cs.tamu.edu/dshell/cs689/papers/anderson72more_is_different.pdf
up to us to decide if is more the same and if it is how do we model that
backlash on google flu

How do you scale machine learning to data centers?

take large complex datasets and chop them up in small pieces you can analyze at scale

Is more different at this scale? And if so, how do we discover it?

Q & A #

Q: (Ware) as you see these data centres emerging, do you think they’ll focus on specific questions? How do you see the data centres forming?

A: The ones I mentioned are around cancer genomics. Sustainability and payment - putting small taxes on certain of our projects so that we can make larger amounts of our data available. Driven by some funding agencies. There’s a certain interest of private donors funding certain parts of this. Some economic incentives. Some combination of that is going to change the way we do science.

back to the speaker list →

Short Talk: Pan-Cancer Analysis of Somatic Variation from Whole Genome ICGC / TCGA Datasets #

Adam Butler, Wellcome Trust Sanger Institute, UK #

Abstract #

The advent of massively parallel sequencing technology has revolutionised the way we characterise cancer genomes and provided new insights in our understanding of the mechanisms of oncogenesis. The International Cancer Genome Consortium (ICGC) was instigated in 2007 with the aim to systematically screen hundreds of Cancer Genomes for 50 distinct tumour types and catalogue the somatic variation present. This endeavor aims to prevent duplication of effort, ensure rare tumours are included and generate large datasets for the scientific community. A similar project is underway in the USA, The Cancer Genome Atlas (TCGA).

In late 2013 at the ICGC conference in Toronto, Peter Campbell announced an ambitious plan to undertake a Pan-Cancer analysis of whole genome data available from ICGC and TCGA. This would provide a comprehensive dataset of somatic variant calls with standardised output for 2,000 cancer genomes, which will be available for subsequent downstream analyses.

The primary analysis will include detection of somatic point mutations, small insertions and deletions, copy number changes, rearrangements and retrotransposon/viral integration sites. To ensure integrity of the dataset, three independent analysis pipelines, provided by the Broad Institute, DKFZ and the Sanger Institute, will be utilised. The data will be generated and stored at 6 data centres around the world; Spain, Germany, Japan, UK, and two centres in the USA.

The Sanger Institutes contribution to this initiative is to provide our analysis pipeline as one of three to be run over the data. Consequently our algorithms have been assessed via rigorous comparison with comparable software and their performance optimised. The pipeline is currently being ported into a VM (Virtual Machine), automated and the code adapted for running all variant detection analyses within a cloud environment.

The primary analysis will deliver a high-quality catalogue of somatic variants in a standardised VCF format and made available from the six centres for downstream investigation.

Notes #

Go over our part and experience with the Pan-Cancer analysis with large datasets

The Cancer Genome Project #

2000 - working through Sanger sequencing, then next gen '07
In order to handle different datasets - build analysis tools and pipelines and system
use them to this day to analyze
heavily integrated into Sanger infrastructure. Now have to look at with bigger scale data

Pipeline:

BWA alignment
Tools: copy number caller - ASCAT - ins/del, rearrangements, transposon, RNA-Seq pipeline
generate VCF, BAM, allow researchers to get useful parts of info and drill down

PanCancer - large international collaboration #

2K genome pairs (4K genomes) from multiple tumour types, 30x coverage
uniform dataset
analysed using 3 pipeline (Broad, DKFZ, Sanger)

CGP -> PanCancer

need to take out each part and make it Sanger free
optimize for different version of aligner
pipeline whole lot using SeqWare (O'Connor)
Just a few seconds - but they add up over few billion bps

Phase 1

identify data for upload, align each sample pair
using GeneTorrent to dl data from CGHub - works very well. Personal concern was on getting data from where it was to where it needed to be. Getting astonishing transfer rate. Automatic data upload.

Useful outcomes #

we moved over to using a version of BWA-MEM (from BWA)- significantly faster and smaller memory footprint. May use for in-house pipelines

optimized callers

looked at where their code was spending time
made huge steps forward - substitution caller is 50% faster
indel caller 2x faster
ICGC benchmarking exercise - invaluable. Allowed us to make much better judgements on how well we are doing
new sequencing technologies go faster still…

Q & A #

Q: (Ware) interested in optimization for indels - can you push that any further? Many of our bottlenecks are in aligners built for human (work in plant)

A: What’s it written in? Perl/Java - eyes roll back in heads and they start shaking. Joking aside, with Caveman (substitution caller) - given someone the time to go back and just re-code proved to give us massive improvement. Recoded in C. Not glamorous or groundbreaking - C really is faster.

back to the speaker list → #

Short Talk: Extensive Variation in Chromatin States Across Humans #

Maya M. Kasowski, Yale University, USA #

Abstract #

The majority of disease-associated variants lie outside protein-coding regions, suggesting a link between variation in regulatory regions and disease predisposition. We studied differences in chromatin states using five histone modifications, cohesin, and CTCF in lymphoblastoid lines from 19 individuals of diverse ancestry. We found extensive signal variation in regulatory regions, which often switch between active and repressed states across individuals. Enhancer activity is particularly diverse among individuals, whereas gene expression remains relatively stable. Chromatin variability shows genetic inheritance in trios, correlates with genetic variation and population divergence, and is associated with disruptions of transcription factor binding motifs. Overall, our results provide insights into chromatin variation among humans.

Notes #

Chromatic variation among people

What makes people different? #

Level of DNA sequence - SNPs
But how do these variants translate to phenotypic differences
Look at gene expression. Look at differences in chromatin
Mapped NFkB

Differences in histone marks differences in gene expression? #

Aim:

Characterize variation in chromatic state
Genetic basis, functional consequences

Used HapMap populations - 19 individuals

9-13 histone marks - deeply sequenced data
Convenient - powerful tool for functionally annotating genome
Enhancers/promoters/ etc

How much variation in chromatin among individuals? #

There’s an enhancer that is active in caucasian and 2 asians, but not africans - SNP in NFkB motif

Striking variation - more than 30% variation at some marks

Combinatorial - chromatin states based on combinations of the marks

promoter states
transcribed states
variety of enhancer states
repressed states

Found that it was more meaningful to ask whether a particular mark varies in the context in a particular state than overall

looking at active enhancer mark - varied more in enhancer state than promoter state
state specific variability
enhancer states more variable than transcribed or promoted
repressed mark - varies more in combination with active marks than on its own

Do states switch among individuals?

not the case, enhancer is an enhancer across individuals.
some reciprocal states

Genetic basis of variation

Active enhancer mark - evidence of strong genetic basis. Strong correlation to genotype to variable than non-variable
Family trios: heritability. found that the extent of varience in daughter correlates to parents

Possible mechanism - differences in TF binding motifs

Strong evidence of this
Link variation to specific motif disruption
Looked at peaks, ENCODE

Functional consequences:

There’s a strong correlation with gene expression (active enhancer - RNA-Seq data). For known enhancer gene lengths (but imperfectly known)
Not all enhancer variation influences expression (but most of them were). Why? - the enhancers are buffering each other. Non-consequential enhancer variation
Chromatin variation is likely to influence phenotypes. Variant regions enriched in eQTLs and GWAS SNPs

Q & A #

Q: (Ware) epigenetic change- were you able to use those as biomarkers and retest GWAS? Uncover hidden variation?

A: Haven’t look at that. This study, 19 individuals. But as we up the scale, perhaps.

Q: Did you look at the trios to see if there’s more concordance among their epigenetic marks than you would have expected on the basis of shared SNPs?

A: Didn’t look at that, we had two trios.

back to the speaker list →

Big Data in Biology: Databases and Clouds

2014-04-01T06:05:00-07:00

Warning: These notes are somewhat incomplete and mostly written in broken english

Databases and Clouds #

Monday, March 24th, 2014 9:30am - 2:15pm

http://ks.eventmobi.com/14f2/agenda/35704/288348
http://ks.eventmobi.com/14f2/agenda/35704/288348

Speaker list #

Laura Clarke, European Bioinformatics Institute, UK

The 1000 Genomes Project, Community Access and Management for Large Scale Public Data -
[Abstract]
[Q&A]

Dan Stanzione, University of Texas at Austin, USA

The iPlant Collaborative: Cyberinfrastructure for 21st Century Biology -
[Abstract]
[Q&A]

Jill P. Mesirov, Broad Institute, USA

GenomeSpace: A Community Web Environment for Genomic Analysis Across Diverse Bioinformatic Tools -
[Abstract]
[Q&A]

Ronald C. Taylor, Pacific Northwest National Laboratory, USA (replaced by Francis Ouellette)

FGED: The Functional Genomics Data Society -
[Abstract]
[Q&A]

Andrew Carroll, DNAnexus, USA

Insights from the Genomic Analysis of 10,940 Exomes and 3,751 Whole Genomes Demystifying Running at Scale and the Scientific -
[Abstract]
[Q&A]

Michael Schatz, Cold Spring Harbor Laboratory, USA

The Next 10 Years of Quantitative Biology -
[Abstract]
[Q&A]
[slides]

The 1000 Genomes Project, Community Access and Management for Large Scale Public Data #

Laura Clarke, European Bioinformatics Institute, UK #

Abstract #

The 1000 genomes data continues to be the largest public variation resource available to the community. Providing coherent and useful resources based on this data continues to be a key goal for the project Data Coordination Center (DCC).

The resource now stands more than 500 Tbytes in size and nearly 500,000 files on the ftp site this presents challenges both for us to manage and for users to discovery what data we have available.

Here I both describe these challenges and present the solutions and tools the project has created to enable the widest level of usefulness for the 1000genomes project data.

http://www.1000genomes.org/

Notes #

1000 genomes project #

Largest human project
Aims:
- complete a baseline of human variation
- all variation - at 1% MAF of higher genome wide.
- 0.1%-0.5% MAF in exonic regions
- structural variations as well as SNVs
BAM and VCF formats started on this project
99% of all variation in an individual is already present in the public catalogue
sequenced 26 populations around the globe. Started with HapMap, nhgri helped get more
collaboration - 10 different sequencing centres. many analysis groups

Strategy

collect shotgun reads, align to reference
detect variations based on alignment from all samples. statistical issues for allowing errors in sampling
in 2008 this was impossible at scale

Analysis Approach

final phase 70bp+ illumina. take much more complicated variations and create phage genomes
multiple centres, multiple technologies

In final phase now

technologies progressed so rapidly, can change aims in the duration of the project
0.5 PB of data

Challenges #

Data Transfer

FTP site growing
20TB 2009 – 580 TB today
synchronizing challenging
download speeds. Aspera (propriety). Download and upload clients

Within Consortium Data Exchange

Data Freezes
- stable release of sequence data
- dated sequecne index file
- alignments based on this index
- variant set calls created from these BAMs
Machine Readable FTP Site: Text file which points to FTP
Standardized naming formats: used sample and population names and what programs/technologies used
Regular communication

Public Accessibility

FTP site - raw data files ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/
AWS Amazon Cloud
web site
ensembl browser

Tools to Assist Data Use

Data slicer
- slicing remote BAM or VCF files
- web front end of samtools
- returns subsection of given file - subset by population, individual
Variant Pattern Finder
VCF to PED: haploview (PED)
Ensembl Variant Effect Predictor
- Predicts functional consequences of variants - SNPs, Indels, Structural Variation
- Web & API based
- Can provide Sift and PolyPhen, HGVS, Refseq gene name
Population Allele Frequency Tool (coming soon!): range of variations

Q & A #

Q: 1000 genomes project - many 340bp all deletions without insertions?

A: Quality - false discovery rate <5%. Sturctural variant very difficult. Wasn’t sufficiently confident in structural variations that aren’t deletions - did not include in db. Structural variations will always be more limited.

Q: Idea of a data freeze and recall - uuid, public key trust network - possible route?

A: sounds like a good idea

back to the speaker list →

The iPlant Collaborative: Cyberinfrastructure for 21st Century Biology #

Dan Stanzione, University of Texas at Austin, USA #

Abstract: #

iPlant is a new kind of virtual organization, a cyberinfrastructure (CI) collaborative created to catalyze progress in computationally-based discovery in plant biology. iPlant has created a comprehensive and widely used CI, driven by community needs, and adopted by a number of large-scale informatics projects and thousands of individual users. iPlant holds more than 1.5 petabytes of user data comprising several hundred million files today, and is thus deeply involved in the “Big Data” challenges of biologists, from storing to analyzing to sharing rapidly growing amounts of data.

This talk will outline the iPlant CI, and discuss what iPlant is doing today to address data challenges, as well as plans for the future. The talk will also address trends the project sees in how users are handling data, and the potential technological solution on the horizon to address them.

iPlant is supported by the National Science Foundation via Award #DBI-1265383.

https://www.iplantcollaborative.org/

Notes #

iPlant - co-director (until 8 weeks ago). Passed co-director to Matthew W. Vaughn

What is iPlant:
community driven organization building cyberinfrastructure for the plant (and animal) science

cyberinfrastructure #

combination of computing, data storage, networking and humans.
to achieve some scientific goal

iPlant #

6th year
14K researchers access services or data - from ecology to epigenomics

Achievements through iPlants open infrastructure

BIEN - generate range maps for species
1KP project - 100M sequence reads - richer tree of plant data. blast annotation
animal mandate - cattle/buffalo piplines
GWAS and more

iPlant Services #

Atmosphere - on demand cloud computing: friendly front end for cloud - web interface. pick images. can log in via shell to image
iPlant data store
discovery environment. rich catalog of bioinformatics machines/tools you can choose from. put together pipelines - gui
iPlant APIs: embed iplant CI capabilities
foundation of computation by TACC
TACC: one of the worlds largest data providers. provides a comprehensive cyberinfrastructure ecosystem. not just machines, tools, apis, team

build your own informatics project!
rPlant - r project built on iPlant
araport - se iplant services

Workflow Optimization and Consulting

12 year analysis - down to 3 days on cluster, working with iPlant
Code optimization: PINT - write code in R, rewrote it done in 4h

Democratizing access to high-throughput genome annotation

Data store: #

federated sources iRODS (DFC) - AWS
geographic replication - U of Austin and TACC
600 TB user data and growing
700 TB Galaxy
200 TB specieal projects
community collections
100GB in 27min - UCBerkley to UA
Evolving the Data Strategy: open file storage, few roles. iDS - some filetype detection, manual meta data tagging, elastic search
Scaling for team science: easy scaling when too large for laptop to open

Big Data Observations #

About 5B files at TACC - 3.5 more than Jan 2013
We delete at least 300M files per month
About 30PB in use
file count and size increasing rapidly
95% of I/O operations don’t actually move data

Soap Box

Average practice is getting worse in data transfer, file i/o and programming
best practice- amazing! - 1,024 core job, generate 1PB in 2h, reanalyzed dozen times < day. good user, know what they’re doing
worst practice - 128 core job- generated 80x metadata traffic of above job and crashed filesystem. moving 1PB over a 10GB/s network via http will take about 1.4 years
c: f=fopen(“file.txt”, “w”); //3 metadata writes
python: f=open(‘file.txt’, ‘w’) // 17 metadata writes
Cloud lets us do stupid things we do in software and run it on a large scale

Speed things up

Technological solutions are coming that can meet demand
machine learning, data transfer can help speed things up. But we still need good software

Q & A #

Q: (illumina) Are there tools to analyze applications to determine their lack of efficiencies?

A: Yes, there are. Caveats: some tools - perfexpert (tooling and analysis) - low level performance tools. Not as useful with non-low level languages. Not great for python.
Build job stats on system - can tell you efficiencies of your code on their system.

Q: (Mesirov) What’s your process on who gets to use it, who doesn’t?

A: iPlant: all resources NSF funded. some EXSEED. xrack - any open science funded researcher. Must be US and published.
iPlant - will open up under 10K hours. tiers on higher use, compare with other users.

back to the speaker list →

GenomeSpace: A Community Web Environment for Genomic Analysis Across Diverse Bioinformatic Tools #

Jill P. Mesirov, CIO at Broad Institute, USA #

Abstract #

Over the last two decades genomics has accelerated at an exponential pace, driven by new sequencing and other genomic technologies, promising to transform biomedical research. These data offer a new era of potential for the understanding of the basic mechanisms of disease and identification of novel treatments. Concurrently, there has been a growing emphasis on integrating all of the available data types to better inform scientific discovery. There are now thousands of bioinformatic analysis and visualization tools for this wealth of data. To leverage these tools to make biomedical discoveries, biologists must be empowered to access them and combine them in creative ways to explore their data. However, this vision has been out of reach for almost all biomedical researchers.

We will describe and give example applications of GenomeSpace, http://www.genomespace.org, an open environment that brings together a community of 14 diverse computational genomics tools and data sources, and enables scientists to easily combine their capabilities without the need to write scripts or programs. Begun as a collaboration of six core bioinformatics tools - Cytoscape (UCSD), Galaxy (Penn State University), GenePattern (Broad Institute), Genomica (Weitzmann Institute), the Integrative Genomics Viewer (Broad Institute), and the UCSC Genome Browser (UCSC) - the GenomeSpace community continues to grow. GenomeSpace features support for cloud-based data storage and analysis, multi-tool analytic workflows, automatic conversion of data formats, and ease of connecting new tools to the environment.
Funding provided by NHGRI and Amazon Web Services

Notes #

GenomeSpace - fairly recent project #

Background

accelerated rate at which biological data acquired. enabled us to do all sorts of global analysis projects
Swamped by development of next gen sequencing technologies
availability of this data has led to progress towards goals to understand disease at the molecular level and understand the genetic basis and mechanisms for disease
now know 3K mendelian disease genes and 5K loci have been associated with over 6K common diseases and traits
ENCODE- all functional elements of genome and dark matter
ICGC/TCGA tumour types

New Trends #

cost down, methods up
more types of data are acquired
miRNA, Copy Number, microRNA, epigenetic- methylation, RNAI. more sensitive and less messy data
increase in integrative approaches. leveraging all these kinds of data
more large-scale projects (x-lab, x-institution)
moved from single gene analysis -> pathway/network view. how genes really work

What do we need to take advantage? #

integrate large data sets and multiple data types.
data management/identification - how do find what helps me?

more complex workflows and algorithms

increasing computational complexity
compute power demands
need to interoperate methods and tools
available and accessible to biologists: in a more friendly way. can’t be just the computational cadre - but whole community

visualize large integrated data sets:
viewers, help us look at reads and see if that call makes sense

validate computational results

Will focus on -> More complex workflows/algorithsm #

interoperate methods and tools
available to all

Integrative genomics

tremendous advances last 10 years
by integrating lots of different kinds of data

Difficulty of getting these tools to work together - need to develop infrastructure.
Challenge: flood of data & proliferation of tools

tools don’t always play well together, want to use them all in one place
2012: 7-10K bioinformatics tools on the web. just Broad - 60-70 tools. not counting internal tools
5K public databases
use case (breat cancer): 12 steps, 6 tools, 7 transitions
- transitions -> data formats different between tools
- how can we democratize this data analysis and bring to the rest of the community?

One monolithic tools OR cooperative approach

lightweight layer for interoperability with automatic data transfer. lightest weight possible - do data transfer for the users
leverage multiple groups and existing tools
access to familiar tools with usual look and feel. so users don’t have to learn how to use them again

GenomeSpace: #

shared vision of 6 bioinformatics tools. get them to talk to each other very easily
have it live in the cloud - server in cloud. talks to GS data sources or components
14 tools right now (4 or 5 on the way). infrastructure at a place where the new tools were enabled in ~1 programmer day. portals: access portals from genome space (eg IM)
Use GenomeSpace S3 storage or add your own Amazon account. Dropbox can be connected. in development: OpenStack & Google Drive

How do I use it? #

Go to cookbook for: how to build a more complex analysis,
How to leverage these different tools

genome space recipe collection

summary of what the recipe does & high level steps and tools
summary of workflow and steps in recipe
video of someone going through the recipe
more detail on recipe - real biological use case
walk through a protocol of all detailed steps
easy to use!

Join the community! http://www.genomespace.org/ #

open source, on bitbucket https://bitbucket.org/genomespace/

Q & A #

Q: (Stein) loved the recipes. Regular recipes still work 50 years later (broccoli doesn’t change). Bioinformatics paper 10 years ago will not work. How much time and effort is required to create a recipe in an environment where tools will be updated? Will it work in 5 years?

A: Tried to limit the scope of the recipes - not beginning to end paper. More simple - just 2 or 3 tools. Committed to setting up steering committee for recipe collection to keep them honest.
RNASeq - many are beginning to use in their work. Yet - methods for analyzing RNASeq hasn’t been settled. Challenge they recognize. Community resource - users can report when recipes aren’t working. Go to forum.

Q: (illumina) Data from different sources, does GenomeSpace provide info on challenges on combining different data?

A:

Can do: put warnings. Watch out for the follow… etc. People who develop these recipes much understand the workflow fairly well so they know the gotchas.

Can’t do: cannot anticipate all the ways in which a biologist will misuse resource
People mis-use tools. Try to give enough info and warning to keep the probability low.

Q: followup: Account for differences in platforms?

A: Don’t have funding for all, but we do contact vendors.

Q: Thank you for making something more user friendly!

Q: Clinical data - do you have the security to handle this?

A: Security that Amazon Cloud provides. New round of funding: agreed to put warnings for ppl who are uploading data. If you have data that needs to be kept private - can use your own Amazon S3/Dropbox.

GenomeSpace does not do analysis - it’s on the tools.

Q: (IBM - Royyuru) Reproducibility - read about a tool in a paper, but can’t reproduce. Can GenomeSpace add machine readable script to run the tool?

A: Can’t go into tools themselves - lightweight. Will talk offline.

back to the speaker list →

FGED: The Functional Genomics Data Society #

Francis Ouellette, Ontario Institute for Cancer Research, Canada #

(Replaced: Ronald C. Taylor, Pacific Northwest National Laboratory, USA)

Selected on merit - not invited talk. Ron has laryngitis - Francis Ouellette is presenting slides.

Abstract #

The Functional Genomics Data Society (FGED) Society, founded in 1999 as the MGED Society, is a registered International Society that advocates for open access to genomic data sets and works towards providing concrete solutions to achieve this. Our mission is to be a positive agent of change in the effective sharing and reproducibility of functional genomic data. Our work on defining minimum information specifications for reporting data in functional genomics papers (e.g., MIAME) have already enabled large data sets to be used and reused to their greater potential in biological and medical research. The FGED Society seeks to promote mechanisms to improve the reviewing process of functional genomics publications. We also work with other organizations to develop standards for biological research data quality, annotation and exchange. We actively develop methods to facilitate the creation and use of software tools that build on these standards and allow researchers to annotate and share their data easily. We promote scientific discovery that is driven by biological research efforts in data integration and meta-analysis.

http://fged.org/

Notes #

Spirit of openness - share everything

Functional Genomics Data Society & Its Mission #

In the beginning there were microarrays - MGED

MIAME - standard for exchange raw data microarray

too much to ask - researchers should publish fully documented code
do reviewers check these?
ArrayExpress and GEO have >6M high throughput assays from 30K functional genomic studies. use MIAME, so it’s working for this group
Many studies have shown the reuseability of these data

MINSEQE - minimal standards on nucleotide seq experiment.
General description of the aim, metadata, raw reads, processed data

FGED Standards: big data needs standards, GFED creates and aids the development of such

FGED is an open society, welcome feedback, input and volunteers

Q & A #

Q: (Stein) What is the journal policy in the continued evolution of this effort?

A: Publishers in general have very great interest and support. They are looking for things like this. PLoS - new data release policy. Publishers keen to see what community agreed upon standards are.

back to the speaker list →

Insights from the Genomic Analysis of 10,940 Exomes and 3,751 Whole Genomes Demystifying Running at Scale and the Scientific Results #

Andrew Carroll, DNAnexus, USA #

Abstract #

As one of five institutions participating in the global CHARGE Consortium, the Human Genome Sequencing Center (HGSC) at Baylor College of Medicine needed a compute and data management infrastructure solution to handle the massive amount of data (3,751 whole genomes and 10,940 exomes) they would be processing for this project. The large burst computational demands for this project would have unacceptably taxed existing resources, requiring either many months of using spare capacity or forcing other users off the cluster for 4-5 weeks to complete it faster. To address this challenge, HGSC, DNAnexus, and Amazon Web Services (AWS) teamed up to deploy a cloud-based infrastructure that could handle this ultra large-scale genomic analysis project quickly and flexibly, with zero capital investment. At the project’s peak, HGSC was able to spin up more than 20,000 cores on-demand in order to run the analysis pipeline of the CHARGE data. During this period, HGSC was running one of the largest genomics analysis clusters in the world.

Notes #

DNAnexus - 2009 spin out from Stanford. Darling of sucessful startups. Apply the Cloud at scale

Two parts:

Philosophy of the Cloud
Application to large project (10-11K exomes)

What is DNAnexus #

scalable solution deploys on AWS (Amazon Web Services) cloud
handle spitting out lots of nodes, sharing data accross users
publish own tools - external or internal

Scientific Vision: #

Challenges looming over data @ scale

Science like driving

car = bioinfo tool
these come out we can do things we couldn’t do before
car accidents (user error, car itself)
improving tools is important -> need to think about the infrastructure used to make these run

Tool development - profile runtimes and cost

optimize for resources (cpu, memory, bandwidth)
now: your tools don’t work on all platforms - configuration headaches
cloud: configure once, run where you want it to run

Benchmarking

Need good benchmark sets - prevent scientific degredation (unit test). Know that you are correct
drive scientific innovation
extend visualizations to reach to more basic biologists. expert bioinformaticians working with basic biologists
deploy at scale
collaboration - prevent data duplication, contribution

Tool Optimization

resource optimization - profile through
DNAnexus - waterfall view of tools! see parallelization

Benchmark sets

compile benchmarks and tools in a single place. can run all tools and benchmark sets. see differences between sets
Configure workflow ui - run 6 variant callers and compare.
visualization - how basic biologists will access the data

Collaborations

managing access to data - admin, viewer, collaborator (roles). can restrict
delivering the data - shipping large-scale data will always be faster and more robust than data transfer. local sftp works for small project. likely true forever

DNAnexus - HGSC-CHARGE Collaboration #

Analysis of 11K exomes and 4K whole genomes for CHARGE consortium.
Comput scale and distribution of results across 300 investigators

Baylor: 20 HiSeqs ~25TB of sequence per month

growth at an exponential rate
load on cluster - pretty much fully booked (w/ some planned down time)
Mercury DNAseq pipeline
- BWA + GATK realign + variant calling
- They took out the most computationally intensive parts of the pipeline and put in DNAnexus
- 10K exomes in 5 days
- 2K nodes, 3.5M cpu hours over 10 days
How much more do you get as your increase in scale?
- new variants as you increase the exome scale - plot sqrt(x)
- as we continue to sequence more and more we are going to find more and more rare variants
compared with variants found in first exome, more likely to be synonymous. variants found in lastest 5K+ - less synonymous
SIFT - tolerant at first, damaging later
Novel - exome 1, most found in dbSNP, exome 5K+ - not found in dbsnp

Q & A #

Q: (Schatz) On projects like this the first half is well structured, but gets very ad-hoc by the end. How is this structured in DNAnexus for ad-hoc queries?

A: We take advantage of the expertise of the ppl working with us. Relying on the CHARGE consortium in collaboration. Directed hypothesis generated by partners.

Q: Can you elaborate on the datasets you’re using as benchmarks?

A: An oppourtunity for the community to come together - benchmarking sets are the way to go, DNAnexus gives us an oppourity to go in this space. Not curators of benchmarks sets.

back to the speaker list →

The Next 10 Years of Quantitative Biology #

Michael Schatz, Cold Spring Harbor Laboratory, USA #

Abstract #

Topic change, no abstract

Slides #

http://schatzlab.cshl.edu/presentations/2014.03.24.Keystone%20BigData.pdf

Notes #

Questions in Biology - some broad, some focused #

Interesting things about these questions - there is no single instrument that answers each of these questions

Answer these questions:

big stack of technologies
raw sensors at the bottom
then systems, compute systems, algorithms, machine learning, > results
Will walk through this pyramid and see what major trends

Bottom tier - sensors : Cost per Genome - drives much of the talks today. need scalability #

map where the major sequencing instruments are across the plant
interesting thing: how widely distributed they are (not like other fields)
worldwide capacity exceeds 15 Pbp/year… 25 Pbp/year on Jan 15 (Illumina X10 systems announcement)
How much is a PB: sequence human genome to 30x - 10K genomes - stack up on DVDs, 787 feet of DVDs (1/6 of a mile tall). 500 2 TB drives $500K

DNA data tsunami - growth of sequencing around 3x per year

not too distant future: ~1 exabyte by 2018
~1 zettabyte by 2024.
- How big? zettabyte is 1M PB
- stack of DVDs = 10B genomes = halfway to moon
- YouTube and astronomy datasets - roughly ~100PB today, growing exponentially

Sequencing Centres map - will be roughly the same

see widespread network of sequencing networks across the planet
biological sensor network nanopore - @ewanbirney https://twitter.com/ewanbirney/status/448423540472422400 mobile - can embed in many remote locations (hospitals, schools, )
the rise of a digital immune system - Schatz. http://www.biomedcentral.com/content/pdf/2047-217X-1-4.pdf

compression will help - need to be aggressive about throwing out data

particle physics - strength here. massive amount of data produced is discarded
resequencing will be negligible
precious-ness of the data/sample: cancer is the high watermark of complexity. in principle we may want to hold on to every read

major applications:

human health - where $$ available
widespread distributed mobile sensors
digital immune system - constantly monitoring what’s coming up (microbes, etc)

Next phase - compute, algorithms #

the compute will be everywhere - Cloud
I had the distinction of having the first paper in PubMed that ever used AWS for sequence analysis
will be multi-cloud - specialized for geographic or political reasons. centric on model organism or disease of study. makes sense to have concentrated system

compute - parallel algorithm spectrum

better parallelization
embarrassingly parallel: problems most easy to run on cluster. building a city? hire 100s of crews, build in parallel
loosely couple algorithms: MapReduce. building skyscraper- can’t build every floor at the same time. a lot of the work is independent but then is aggregated together
Tightly coupled: graphs and MD simulations. growing one massive tree - more farmers will not help. “nine women cannot make a baby in one month”

Better hardware:

MUMmerGPU
specialized hardware (GPU)

Crossbow - algorithm on map reduce

using many commodity computers - run algorithm in parallel (map reduce)
use Bowtie and SOAPsnp
compelling example of cloud computing in genomics. transfer time and cost – improving
challenge: requires more applications!
each algorithm requires customization - need skilled developers

PanGenome alignment and assembly

shifting to paradigm where raw input is set of complete genomes
emerging long read sequencing technologies
can assemble entire microbial / yeast genomes into complete assemblies
could be the case we have complete human genomes - get started now
start with set of individual genomes - segments of genomes in graph. get context by graph - De Bruijn graph

See major informatics centers on topics

moving code to data
driven by parallel algorithms/hardware
shift to large populations
applications: read mapping will fade out, new problems (at population level) will replace it

Top of slice: Results: work at CSHL - genetics of autism #

Sample set: 3K families - simplex families

one child has autism but rest of siblings not autistic
sequence exomes of all individuals across families
what do we observe relative to siblings/parents?
focus: gene killing mutations. loss of function/ specific to autistic children to find genes associated with the disease
identifying SNPs quite mature - GATK broad, handles biases

SCALPEL - find indels from short read sequencing data

combine best of alignment and assembly
use standard aligner to map reads to genome. purpose of this alignment is to localize the problem (locally, not globally - one exon/region at a time)
extract out reads that localize to a particular part
on the fly assembly with de Bruijn graph
find end to end haplotype paths spanning graph
align assembled sequence to region

Experimental analysis and validation

selected one deep coverage exome for deep analysis
GATK, SCALPEL, SOAPindel
99% accuracy where all overlap
specific to SCALPEL - 77% (more than others)

de novo genetics of autism - same number of mutations as siblings

but gene killers - enrichment in autistic kids
2:1 enrichment in nonsense
2:1 enrichment of frameshift
4:1 splice site mutations
correlation to age of father

available in bio archive, code available in SourceForge

Potential for big data #

folks from Google: flu trends in nature - 2009
google searches for flu like symptoms - then outbreaks occur
Fallacy of big data? - They’ve gotten it wrong. ‘big data hubris’ - assumption that big data are a substitute for data collection and analysis. pipelines are extremely important
risks of big data - given birthday and hometown - can predict SSN with good accuracy

Power from data aggregation - champion ourselves and the future #

mindful of risks - over-fitting, reproducibility
caution is prudent
data aggregation isn’t going to solve anything- being critical - does this make sense? continuous feedback loop

What is a data scientist? Many fields. To be really successful, you need strengths, experience and expertise in these fields.

Q & A #

Q: Observation: Talking about the sequencing coming down in price - What happens when sequencing becomes so cheap and democratized that any can do this? How do we as a community get the legislature to start thinking of these privacy concern? We need to look at this data

A: No simple answer. Part of it will come through scientific discoveries - congressmen pay attention when there’s big breakthroughs. Lobby - we need to talk to the rest of the world. Part of it going to come in reponse when there are outbreaks - when data is abused. There’s already some legislation in place so you cna’t get discriminated against for, say, insurance. But there’s implicit discriminations. Don’t know how to fix outside of education and reaching out to the next gen.

Q: (Mesirov)

Congratulations: terrific meeting!
30+ years ago I heard Grace Murray Hopper speak - made a comment about how we are all going to be drowning in data. All kinds of data. I appreciated your comment on what we keep. Important: we have some kind of metric of utility - huge amounts of it not touched for long periods of time. Think about what happens with this data that is never used again. Otherwise we’re all going to drown

A: The utility of data is certainly something to be considered. We’ve bad at estimating it. We’re all hoarders. System failing recently- can’t copy off a PB of data fast enough. Trying to assess the preciousness of data and time. Some metrics are hard to measure. I anticipate the storage vendors will get better at providing tools to assess what is on a filesystem. Tools today are crude, i hope these will improve. At the very least we can identify if there are big datasets we haven’t accessed in years

Q: (Swedlow) At Dundee, hierarchical filesystems backed up by tape. Primary data is images and proteomics - 95% of it is not touched again 3 months later. Graph representations of sequences - we will be doing the same thing with images. Concerned with the computational cost of recalculating these graphs. How expensive will recalculation be?

A: today it’s expensive - but this is an opportunity for research. For example: at level of suffix trees - construction methods. We can dust those off and improve algorithms.

back to the speaker list →

Abigail Cabunoc Mayes

How to bring open source to a closed community

Increasing developer engagement at Mozilla {Science|Learning|Advocacy|++}

Starting in Science #

Contributor Pathways #

What Next? #

What I learned working at WormBase / OICR

Joining the Mozilla Science Lab!

Why Mozilla Science Lab? #

What Now? #

Biocuration 2014: Battle of the new curation methods

Text mining #

Machine learning #

Crowdsourcing #

Conclusion #

Further reading #

WormBase Website and Biocuration

Correction: Poster #44! #

WormBase Website: Supporting the Biocuration Process #

Big Data in Biology: Big Data Challenges and Solutions: Control Access to Individual Genomes

Other posts in this series: #

Panel: Big Data Challenges and Solutions: Control Access to Individual Genomes #

Panel members #

Notes #

Challenges and opportunities #

Personal Genomes #

Panel Introductions #

Other posts in this series: #

Big Data in Biology: Imaging/Parmacogenomics

Other posts in this series: #

Imaging/Parmacogenomics #

Susan Sunkin, Allen Institute for Brain Science, USA #

Notes #

Allen Institute: primarily studying mouse & human #

Allen mouse brain Atlas #

Developing mouse brain atlas #

Allen Human Brain Atlas #

Developing human Brain project #

Jason R. Swedlow, University of Dundee, Scotland #

Notes #

Problem #

2 Possible Solutions #

OMERO #

Sharing and Publishing data #

Directions #

Imaging Community #

Publishing Large Imaging Datasets #

Douglas P.W. Russell, University of Oxford, UK #

Notes #

How OMERO can scale with big data, higher demand #

OMERO services #

John Overington, European Molecular Biology Laboratory, UK #

Notes #

Background #

Chemogenomics = chemistry + genome derived objects #

ChEMBL - training set; largest db of medicinal chemistry data 1.4M compounds #

Visualizations #

Pharma Productivity problem #

Cancer Drugs and Targets #

Other posts in this series: #

Big Data in Biology: Personal Genomes

Other posts in this series: #

Personal Genomes #

Lincoln D. Stein, Ontario Institute for Cancer Research, Canada #

Notes #

The solution => The Pan-Cancer Whole Genome Analysis Project (PAWG/Pan-Can) #

Genome Analytics with IBM Watson #

Notes #

Translational Medicine: #

IBM Watson #

Genomics - not just about genes. How we connect that knowledge #

p53 project example - ingest a lot - mine the literature. #

Mark Gerstein, Yale University, USA #

Slides #

Notes #

My perspective on Big Data #

What do people do with big data? #

How we can organize information in genomics - networks #

What is genome annotation? #

Other organisms: Yeast genome #