tag:blog.abigailcabunoc.com,2014:/feedAbigail Cabunoc Mayes2016-09-19T21:19:23-07:00Abigail Cabunoc Mayeshttp://blog.abigailcabunoc.comSvbtle.comtag:blog.abigailcabunoc.com,2014:Post/how-to-bring-open-source-to-a-closed-community2016-09-19T21:19:23-07:002016-09-19T21:19:23-07:00How to bring open source to a closed community<p>This is (roughly) a transcript of my talk at <a href="http://www.thestrangeloop.com/">Strange Loop</a> this year! At least, it’s what I meant to say. <a href="https://youtu.be/iR8xqEVTeMQ">Watch the video</a> for all the fun Canada facts and nervous rambling.</p>
<p><a href="https://github.com/acabunoc/open-source-strangeloop-2016">Slides</a> made using <a href="http://lab.hakim.se/reveal-js/">reveal.js</a>. Screenshots captured using <a href="https://github.com/astefanutti/decktape">Decktape</a></p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/"><img src="https://svbtleusercontent.com/vymnuvzhom1dig_small.png" alt="open-source-strangeloop-2016-001.png"></a></p>
<p>First off, I want to thank the organizers for this opportunity. Strange Loop is such an amazing conference – I can’t believe I fist attended with an opportunity grant two years ago. The friendships and community I’ve built here have been amazing.</p>
<p>Let’s get started!</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/1"><img src="https://svbtleusercontent.com/ybamlbl52fekqa_small.png" alt="open-source-strangeloop-2016-002.png"></a></p>
<p>Hi, I’m Abby! This is me. I work for the <a href="https://www.mozilla.org/en-US/foundation/">Mozilla Foundation</a> as Lead Developer of Open Source Engagement. This means I with with the open source projects and community around the different programs at the Mozilla Foundation including Open Science, Internet of Things, Women and Web Literacy, Learning and Advocacy.</p>
<p>Also, I’m from Toronto. This is important because Toronto is great.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/2"><img src="https://svbtleusercontent.com/rhygl05m6chfzq_small.png" alt="open-source-strangeloop-2016-003.png"></a></p>
<p>A bit of history: I came to Mozilla because of the <a href="http://science.mozilla.org/">Mozilla Science Lab</a>. Before Mozilla, I was working in research labs where we were dealing with so much data and analysis. It was easy to see how the openness and collaboration available on the web could make science better.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/3"><img src="https://svbtleusercontent.com/7o0ksz7uok5mdq_small.png" alt="open-source-strangeloop-2016-004.png"></a></p>
<p>At Mozilla, our mission is to ensure the Internet is a global public resource, open and accessible to all.</p>
<p>The Science Lab is applying Mozilla’s mission to a specific community of practice. Most of the work I’m covering today was done within the Mozilla Science Lab.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/4"><img src="https://svbtleusercontent.com/8shoueaa7redfg_small.png" alt="open-source-strangeloop-2016-005.png"></a></p>
<p>So, today we’re talking about bringing open source to a closed community. Slight disclaimer: this is my story! This is not a how-to that will work for everyone.</p>
<p>The past eight years of my career, I’ve been working on open source projects for researchers and thinking of ways to bring more open source to academia. I want to share some of the lessons I’ve learned and hear from you as we start to expand to other Mozilla Foundation programs.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/5"><img src="https://svbtleusercontent.com/jmdiliyqamgaug_small.png" alt="open-source-strangeloop-2016-006.png"></a></p>
<p>My story starts with open source. I actually wrote open source code for years before I fully understood this movement.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/6"><img src="https://svbtleusercontent.com/5ivdyt51hrasfa_small.png" alt="open-source-strangeloop-2016-007.png"></a></p>
<p>I find it’s helpful to look at origins of terms and words to help give some cultural context around what this meant at the time.</p>
<p>‘Open source’ is interesting because the free software movement predates this term by over a decade.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/7"><img src="https://svbtleusercontent.com/dle7jxeosk9w_small.png" alt="open-source-strangeloop-2016-008.png"></a></p>
<p>In 1997, Eric Raymond published an essay on the state of free software at the time, <a href="http://www.catb.org/esr/writings/cathedral-bazaar/">“The Cathedral and the Bazaar”</a>. He saw two types of free software:</p>
<ul>
<li>The Cathedral is a public space where anyone is welcome to attend a service, but the experience is put on by a small group of people in charge. They decide what happens and when. This is like a development team working on software among their trusted group, then releasing a new version to the public.</li>
<li>The Bazaar is an open space where people come along, setup tables and start bartering and selling whatever goods they have. Anyone can come and shape the experience in this space. Raymond saw this happening in Linux at the time, a diverse group full of differing agendas that was able to work together to build a stable system.</li>
</ul>
<p>Also, I can’t be sure, but it looks like there might be a fire in this Bazaar. Metaphor for open source? :)</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/8"><img src="https://svbtleusercontent.com/fenbyji6xiqcjg_small.png" alt="open-source-strangeloop-2016-009.png"></a></p>
<p>This essay inspired the Netscape Corporation to <a href="https://www.mozilla.org/en-US/about/history/details/">release the Netscape browser suite as free software</a> the following year. This became the basis of the Mozilla Project and inspired the term <a href="http://opensource.org/history">open source</a>.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/9"><img src="https://svbtleusercontent.com/jzitdn48eqdmpg_small.png" alt="open-source-strangeloop-2016-010.png"></a></p>
<p>I don’t know if you all remember the early 2000s, but there were no browser wars then – Internet Explorer was everywhere. The fact that a group of passionate open source contributors were able to come together and build Firefox, the browser that toppled the giant, was really amazing.</p>
<p><a href="https://svbtleusercontent.com/kxuge9fw3fiba.gif"><img src="https://svbtleusercontent.com/kxuge9fw3fiba_small.gif" alt="glee.gif"></a></p>
<p>At the heart of open source is the idea that a diverse group working on a problem is better.</p>
<p>But how do we get there? How do we work in a way that brings this diverse group together in the first place?</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/10"><img src="https://svbtleusercontent.com/sqthi3c2etena_small.png" alt="open-source-strangeloop-2016-012.png"></a></p>
<p>At Mozilla, we call this <a href="https://wiki.mozilla.org/Working_open">“Working Open”</a>, being public and participatory. This requires structuring efforts so that “outsiders” can meaningfully participate and become “insiders” as appropriate.</p>
<p>For me, this way of thinking helped me understand what open source should look like in our day-to-day work.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/12"><img src="https://svbtleusercontent.com/kd3tn4hhzygv3a_small.png" alt="open-source-strangeloop-2016-013.png"></a></p>
<p>For the official definition of open source, the <a href="https://opensource.org/">Open Source Initiative</a> has <a href="https://opensource.org/osd">ten points</a> outlining what exactly open source software is. Having this comprehensive definition along with the OSI has helped the open source movement stay strong today.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/13"><img src="https://svbtleusercontent.com/bjzywpoxp5hoig_small.png" alt="open-source-strangeloop-2016-014.png"></a></p>
<p>The next part of my story is Science! I worked in research labs writing scientific software for most of my career.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/14"><img src="https://svbtleusercontent.com/jabwsdm1iyzjw_small.png" alt="open-source-strangeloop-2016-015.png"></a></p>
<p>Sometimes, trying to participate in research can feel like this. As soon as I left academia I lost access to most published research in academic journals. Even within academia, institutions can feel like ivory towers where only the invited few can participate.</p>
<p>These drawings are by John McKiernan for <a href="http://whyopenresearch.org/">“Why Open Research?”</a></p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/15"><img src="https://svbtleusercontent.com/k8qal8q67rs4ea_small.png" alt="open-source-strangeloop-2016-016.png"></a></p>
<p>On the other side of the wall, there can be a lot of fear around getting scooped or someone stealing your data. This stems from a lack of knowledge around open licensing options.</p>
<p>Contrasted with my experience in open source, this helped me see that on both sides of the wall, there’s a need for culture change if academia is going to work openly.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/16"><img src="https://svbtleusercontent.com/vl2wvojctmybg_small.png" alt="open-source-strangeloop-2016-017.png"></a></p>
<p>One of the first projects I worked on when I joined Mozilla Science was <a href="https://science.mozilla.org/projects/">Collaborate</a>, a collection of open source software for scientists. This was a great way to highlight some of the work going on in this community, but after watching these projects for awhile, I learned that researchers weren’t very good at open source.</p>
<p>In general, the projects weren’t as welcoming as they could be. Sometimes, requests from potential contributors for more information would be ignored for weeks. This list of projects still exists today and helps the open science space tremendously, but we thought we could make this better.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/17"><img src="https://svbtleusercontent.com/qbantmxqdfknua_small.png" alt="open-source-strangeloop-2016-018.png"></a></p>
<p>This brings us to the final part of my story (and most of this talk): Fueling the movement.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/18"><img src="https://svbtleusercontent.com/xsoxgokosbpngq_small.png" alt="open-source-strangeloop-2016-019.png"></a></p>
<p>I couldn’t find a definition of ‘movement’ that I liked, so I defined it here as “mobilizing a community around a shared purpose”.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/fW8amMCVAJQ"></iframe>
<p>One of my favourite visual representations of a movement is a this clip of a dancing guy fro <a href="https://www.youtube.com/watch?v=fW8amMCVAJQ">Derek Siver’s TED talk</a> on leadership and movements.</p>
<p>One guy dancing enthusiastically slowly mobilizes those around him. Once you hit critical mass, you have a movement! You watch him change the culture here in a few minutes.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/19"><img src="https://svbtleusercontent.com/52g3gpmhznohdw_small.png" alt="open-source-strangeloop-2016-021.png"></a></p>
<p>This is a figure from <a href="http://marshallganz.com/publications/">Marshall Ganz</a>’s essay <a href="http://marshallganz.usmblogs.com/files/2012/08/Public-Narrative-Collective-Action-and-Power.pdf">“Public narrative, collective action, and power”</a>. A key part of a movement is mobilizing people to action. This diagram shows how we need both the strategy and narrative (head + heart) to take action.</p>
<p>Working with researchers, many of them want to be working open and collaborating more – you can see how many open source for science projects wanted to be listed in ‘Collaborate’. However, there’s a lack of knowledge or strategy around <em>how</em> to do this effectively. This is when we realized we need to make resources outlining the steps involved in running an open source project.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/21"><img src="https://svbtleusercontent.com/sl7bmamxqwi6g_small.png" alt="open-source-strangeloop-2016-022.png"></a></p>
<p>So we started to think about how we can best fuel the open source movement within academia. I think we can summarize it in these three steps:</p>
<ol>
<li>Resources: Creating the resources needed to mobilize others</li>
<li>Leaders: Selecting leaders in our community. Use the resources we created to mobilize them.</li>
<li>Mentorship: Helping our leaders mobilize others through mentorship.</li>
</ol>
<p>First up, resources!</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/22"><img src="https://svbtleusercontent.com/acymkas4ouffug_small.png" alt="open-source-strangeloop-2016-023.png"></a></p>
<p>To create resources, we did an exercise focusing on the “Working Open” aspect of open source. How do outsiders become insiders on our projects? We’re going to do this exercise now as the audience participation section of the talk!</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/23"><img src="https://svbtleusercontent.com/oe13gftyhfafw_small.png" alt="open-source-strangeloop-2016-024.png"></a></p>
<p>Think of a place you felt welcomed the first time you visited. This can be in person or online. I’ll give you a minute to think of a place in your head.</p>
<p>Okay, what places did you think of?</p>
<p><em>Some of the answers</em>:</p>
<ul>
<li><em>Strange Loop</em></li>
<li><em>Canada</em></li>
<li><em>College</em></li>
<li><em>Niagara Falls</em></li>
</ul>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/24"><img src="https://svbtleusercontent.com/ml6dhkfw13l5a_small.png" alt="open-source-strangeloop-2016-025.png"></a></p>
<p>Now, what made it welcoming?</p>
<p><em>Some of the answers</em>:</p>
<ul>
<li>
<em>Strange Loop</em>
<ul>
<li><em>Everyone is friendly and wants to know where you’re from and what you do. >> friendly, human welcome</em></li>
<li><em>Food and snacks. >> takes care of our needs</em></li>
<li><em>Smaller Preconf events >> makes it easy to find connections</em></li>
<li><em>Opportunity grants >> makes it easy to get involved</em></li>
</ul>
</li>
<li>
<em>College</em>
<ul>
<li><em>Orientation week >> orient people to their new environment, show them where they can get involved and make friends</em></li>
</ul>
</li>
</ul>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/25"><img src="https://svbtleusercontent.com/rhg4jie624nda_small.png" alt="open-source-strangeloop-2016-026.png"></a></p>
<p>How can we apply these to software projects?</p>
<p><em>Some of the answers</em>:</p>
<ul>
<li>
<em>friendly, human welcome</em>
<ul>
<li><em>say hi and welcome new people in chat, mailing list, etc</em></li>
</ul>
</li>
<li>
<em>take care of our needs</em>
<ul>
<li><em>clear installation instructions, contributing guidelines</em></li>
</ul>
</li>
<li>
<em>make it easy to get involved</em>
<ul>
<li><em>good README, starter issues for new people</em></li>
</ul>
</li>
</ul>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/26"><img src="https://svbtleusercontent.com/pmtqcqwjp4ifw_small.png" alt="open-source-strangeloop-2016-027.png"></a></p>
<p>We went through this exercise and came up with a bunch of ways to make open source projects more welcoming. We came up with these seven points and put together handouts for each point.</p>
<p>I think we came up with a lots of these in the exercise we just did!</p>
<ol>
<li>
<a href="http://mozillascience.github.io/working-open-workshop/github_for_collaboration/">Public repository</a>: make sure your code, history, and discussion is public and available on the web.</li>
<li>
<a href="http://choosealicense.com/">Open license</a>: this goes back to the official Open Source definition. Make sure your code is licensed in a way that others can legally contribute and remix your work.</li>
<li>
<a href="http://mozillascience.github.io/working-open-workshop/writing_readme/">README</a>: Especially with GitHub, this is often the people’s first introduction to a project. Be welcoming!</li>
<li>
<a href="http://mozillascience.github.io/working-open-workshop/roadmapping/">Roadmap</a>: At the very least, break down what you plan to do in issues. This way people know how can get involved and what work you’re looking for.</li>
<li>
<a href="http://mozillascience.github.io/working-open-workshop/code_of_conduct/">Code of Conduct</a>: Collaboration is hard and collaboration with a diverse group can be messy. A code of conduct is a good step towards making people feel safe and outlining the behaviour expected in the group.</li>
<li>
<a href="http://mozillascience.github.io/working-open-workshop/contributing/">CONTRIBUTING.md</a>: This is another file that has become more important because of the GitHub experience. Your contributing guidelines can outline how a new contributor can participate in your community.</li>
<li>
<a href="http://mozillascience.github.io/working-open-workshop/">Mentorship</a>: This is a larger topic that covers both the attitude and strategy needed to make something welcoming and fuel a movement.</li>
</ol>
<p>I’ll be sharing more about each of these steps later in the talk.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/27"><img src="https://svbtleusercontent.com/8nmagxqybcrv0q_small.png" alt="open-source-strangeloop-2016-028.png"></a></p>
<p>Next part of ‘Fueling the Movement’ is investing and mobilizing leaders. We can use the resources we just created to mobilize some of our more involved community members.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/28"><img src="https://svbtleusercontent.com/cdfjwu7oxv6zjq_small.png" alt="open-source-strangeloop-2016-029.png"></a></p>
<p>We did this at our first <a href="http://mozillascience.github.io/working-open-workshop/">Working Open Workshop</a> in February in Berlin. We brought together some of our existing project leads and more active community members. This was a group of people passionate about what we’re doing and eager to learn skills that would help their work be more open.</p>
<p>We put on a two day workshop going over most of the lessons from the Open Source Checklist. We built in lots of time for group work where participants could start applying the lessons they’ve learned to their open source projects.</p>
<p>This was a great start, but we wanted to keep up momentum after the workshop. We’ve all done weekend courses and workshops where we leave with the best intentions, but then life gets in the way and we forget. To combat this, we offered 1:1 mentorship after the workshop.</p>
<p>We planned this workshop to happen three months before our <a href="https://science.mozilla.org/programs/events/global-sprint-2016">Global Sprint</a>, a two day hackathon on open source and open data projects. The 1:1 mentorship would occur over the three months preparing the projects for the Sprint.</p>
<p>Now we’re going to draw out the movement in action! </p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/29"><img src="https://svbtleusercontent.com/rpfhknxuqhe8w_small.png" alt="open-source-strangeloop-2016-030.png"></a></p>
<p>We start here with Abby (that’s me!) and <a href="https://twitter.com/auremoser">Aurelia</a>, Community Lead for the Mozilla Science Lab. Aurelia is also a strong open source developer in her own right. The two of us decided to offer mentorship to all Working Open Workshop (WOW) participants.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/30"><img src="https://svbtleusercontent.com/qncfoofadl41aq_small.png" alt="open-source-strangeloop-2016-032.png"></a></p>
<p>27 people attended WOW.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/32"><img src="https://svbtleusercontent.com/hxp82y9pelfg_small.png" alt="open-source-strangeloop-2016-034.png"></a></p>
<p>25 of them signed up for 1:1 mentorship. We called this group the Open Leadership Cohort (OLC).</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/34"><img src="https://svbtleusercontent.com/inn2scpgrgilzq_small.png" alt="open-source-strangeloop-2016-035.png"></a></p>
<p>We met with each project every two weeks for a quick 30min check-in.</p>
<p>We started our mentorship meetings by setting goals. WOW was fresh in their minds! We helped set goals around:</p>
<ul>
<li>Their community: what do they want their contributor base / user base to look like?</li>
<li>Their product: Will they ship a new feature or release an MVP at the sprint?</li>
</ul>
<p>Then, we set a loose plan around how to accomplish this over three months. This set us up to be able to do lightweight check-ins every two weeks to see how things are going and where we need to troubleshoot.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/35"><img src="https://svbtleusercontent.com/cnxgxrzjdo1oba_small.png" alt="open-source-strangeloop-2016-038.png"></a></p>
<p>As soon as we started, 8 new people were added to the program since many projects had co-leads who wanted to join in.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/38"><img src="https://svbtleusercontent.com/fsbgaaqbth5dw_small.png" alt="open-source-strangeloop-2016-040.png"></a></p>
<p>The yellow nodes are all the people that made significant contributions to mentored projects at the Global Sprint at the end of this round of mentorship. The contributions were significant enough that the project lead decided to give them a shout-out on the <a href="https://science.mozilla.org/programs/events/project-call-june-23">Mozilla Science Project Call</a>.</p>
<p>It was great to see the project leads start to engage and mentor new contributors on their projects.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/40"><img src="https://svbtleusercontent.com/wj3jrfwf45ndgq_small.png" alt="open-source-strangeloop-2016-041.png"></a></p>
<p>For a bit more background on the Global Sprint, here’s a picture from our 2015 Global Sprint. This year, we had 40 sites around the world all hacking from 9-5 in their time zones.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/41"><img src="https://svbtleusercontent.com/3o8ekis13paqg_small.png" alt="open-source-strangeloop-2016-042.png"></a></p>
<p>We saw a massive increase in participation through GitHub activity this year. I think this is directly linked to the resources we made on working openly which we offered to all participating project.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/42"><img src="https://svbtleusercontent.com/pnkknjd9idvntq_small.png" alt="open-source-strangeloop-2016-045.png"></a></p>
<p>Now that we mobilized the leaders, we wanted to work with them as they mobilized others. We do this through more mentorship.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/45"><img src="https://svbtleusercontent.com/jnxavbxwwzymg_small.png" alt="open-source-strangeloop-2016-047.png"></a></p>
<p>We selected a few of the people we mentored to become mentors in round 2. We intentionally kept the group of mentors small as we tested this out.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/47"><img src="https://svbtleusercontent.com/8ltufo1ensoq_small.png" alt="open-source-strangeloop-2016-049.png"></a></p>
<p>We wanted to test our this type of mentorship around open source in other programs. We asked each program to nominate a few community members for mentorship. We have participants from Open Science, Internet of Things, Internet Policy & Advocacy and more. We paired each mentor with 1-2 participants.</p>
<p>This round of the program started mid-August and is running till the <a href="http://mozillafestival.org/">Mozilla Festival</a> (MozFest), Oct 28-30 in London UK. MozFest is the world’s leading event for and by the open Internet movement. All the participants and mentors in the program will be running sessions at MozFest – we’re using this program to help prepare their projects for the festival.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/49"><img src="https://svbtleusercontent.com/vcp1iuw9fvdmxg_small.png" alt="open-source-strangeloop-2016-051.png"></a></p>
<p>Now we’re going to look at a few stories and lessons we’ve learned going through this experience.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/51"><img src="https://svbtleusercontent.com/rpnirpwa962nra_small.png" alt="open-source-strangeloop-2016-052.png"></a></p>
<p>I’m going to go through each lesson from the Open Source Checklist and tell you a story about how that lesson affected someone in the mentorship program.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/52"><img src="https://svbtleusercontent.com/d9hbqurvvv4vq_small.png" alt="open-source-strangeloop-2016-053.png"></a></p>
<p>First up is having a Public Repository and looking at Achintya’s story.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/53"><img src="https://svbtleusercontent.com/gga021dam5gptg_small.png" alt="open-source-strangeloop-2016-054.png"></a></p>
<p><a href="http://twitter.com/raoofphysics">Achintya</a> is a science communicator at <a href="https://home.cern/">CERN</a> and a PhD student in scicomm at <a href="http://www.uwe.ac.uk/">UWE Bristol</a>. We’re going to talk about how GitHub usage helped him centralize and organize efforts around his project.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/54"><img src="https://svbtleusercontent.com/v3myg7lp9sodfg_small.png" alt="open-source-strangeloop-2016-055.png"></a></p>
<p>Achintya has an interesting project, <a href="https://opencosmics.github.io/">Open Cosmics: Cosmis-ray physics for everyone!</a></p>
<p>For a bit of science background: cosmic rays are high-energy particles that bombard the earth’s atmosphere. This produces showers of particles that we can detect on the earth’s surface. You can even detect these particles with your phone by installing <a href="https://crayfis.io/">CRAYFIS</a>. You can also get a pocket sized detector from <a href="http://cosmicpi.org/">Cosmic Pi</a>.</p>
<p>The problem that Achintya is tackling is that there are all sorts of ways to measure cosmic-rays, but each project stores the data in different formats. Achintya’s project, Open Cosmics, attempts to bring together all these efforts and help with interoperability and data standards.</p>
<p>You may have noticed in the “movement graph” that Achintya brought on three additional projects leads to this project. He was in a unique position where he acted a facilitator between all the projects collecting cosmic-ray data.</p>
<p>At the end of our first round of mentorship, when we asked Achintya was most helpful he said it was learning how to use GitHub for project management. GitHub gave his community a central place to community and the tools he needed to organize and discuss.</p>
<p>Now, Achintya is mentoring two other projects!</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/55"><img src="https://svbtleusercontent.com/jk24supopxmrwa_small.png" alt="open-source-strangeloop-2016-056.png"></a></p>
<p>So, make sure your code is available! At the Mozilla Foundation we rely a lot of <a href="http://github.com/">GitHub</a> and have produced some trainings on <a href="http://mozillascience.github.io/working-open-workshop/github_for_collaboration/">GitHub for collaboration</a>. But there are many <a href="https://bitbucket.org/">other</a> <a href="http://gitlab.com/">services</a> you can use for your public repository.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/56"><img src="https://svbtleusercontent.com/9fruvayc0g6vtw_small.png" alt="open-source-strangeloop-2016-057.png"></a></p>
<p>Next, we’re going to look at having an open license and how that helped Rob.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/57"><img src="https://svbtleusercontent.com/aahziwo9y2ryjq_small.png" alt="open-source-strangeloop-2016-058.png"></a></p>
<p>This is <a href="https://twitter.com/robertjsullivan">Rob</a>! He was fairly new to open source when he joined us.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/58"><img src="https://svbtleusercontent.com/moxunedclgxlcw_small.png" alt="open-source-strangeloop-2016-059.png"></a></p>
<p>This is a blurry Rob at our Working Open Workshop. We’re all doing the <a href="https://github.com/mozillascience/working-open-workshop/issues/42">‘Open Web Stretch’</a> here. I believe they’re all “leaning left to avoid the NSA”.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/59"><img src="https://svbtleusercontent.com/canqro2nxr0na_small.png" alt="open-source-strangeloop-2016-060.png"></a></p>
<p>Rob’s project was creating a tool built around <a href="http://www.ncbi.nlm.nih.gov/pmc/">PubMed Central</a>, a repository for life science and biomedical research. He created <a href="http://pmc-ref.herokuapp.com/">PMC-ref</a>, a tool where you input a paper, then it checks which references in the paper are free to read.</p>
<p>It’s a pretty simple tool that can have a huge impact for a life sciences researcher. Especially if they don’t have access to all the big journals.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/60"><img src="https://svbtleusercontent.com/0om2wfgf485ig_small.png" alt="open-source-strangeloop-2016-061.png"></a></p>
<p>I mentioned Rob with this lesson, because going through his GitHub repo, he added an open license days after the Working Open Workshop. Yay MIT license!</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/61"><img src="https://svbtleusercontent.com/f1q6z0dxugbsvq_small.png" alt="open-source-strangeloop-2016-062.png"></a></p>
<p>If you see the yellow dot linked to him, Rob received his first open source contribution ever during the Global Sprint! The contributor, Deborah, actually wrote <a href="http://deborah-digges.github.io/2016/06/04/Mozilla-Science-Lab-Global-Sprint-16/">a blog post</a> about her experience at the Global Sprint and contributing to this project. The fact that he had an open license made this possible and legal.</p>
<p>Rob is now mentoring Minn. Minn is running an interesting <a href="https://github.com/MozillaFoundation/mozfest-program-2016/issues/635">session at MozFest</a> around facial recognition to create art and generate metadata.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/62"><img src="https://svbtleusercontent.com/3kltoeppaqwkhq_small.png" alt="open-source-strangeloop-2016-063.png"></a></p>
<p><a href="http://choosealicense.com/">choosealicense.com</a> is a great resource for picking an open license for your software. For something easy, Mozilla Science recommends <a href="http://choosealicense.com/licenses/mit/">MIT</a> or <a href="http://choosealicense.com/licenses/bsd-2-clause/">BSD</a>.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/63"><img src="https://svbtleusercontent.com/zquxuofccz4ypa_small.png" alt="open-source-strangeloop-2016-064.png"></a></p>
<p>Next we have <a href="http://twitter.com/kirstie_j">Kirstie</a> who really embraced writing a great README and having welcoming project communication.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/64"><img src="https://svbtleusercontent.com/yo0p6ljgt0j1bq_small.png" alt="open-source-strangeloop-2016-065.png"></a></p>
<p>This is Kirstie! She’s a postdoctoral researcher in the <a href="http://www.bmu.psychiatry.cam.ac.uk/">Brain Mapping Unit</a> at the <a href="http://www.cam.ac.uk/">University of Cambridge</a>.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/65"><img src="https://svbtleusercontent.com/oftbvcdmeaf8ww_small.png" alt="open-source-strangeloop-2016-066.png"></a></p>
<p>We recently announced that Kirstie is one of the new <a href="https://science.mozilla.org/programs/fellowships">Mozilla Fellows for Science</a> this year! Mozilla Science has a fellowship program for researchers who want to influence the future of open science and data sharing within their communities. Fellows spend 10 months as community catalysts at their institutions and building lasting change in the global open science community.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/66"><img src="https://svbtleusercontent.com/j4o63kdpe4jtw_small.png" alt="open-source-strangeloop-2016-067.png"></a></p>
<p>During the first mentorship round, Kirstie worked on her project <a href="http://stemmrolemodels.com/">STEMM Role Models</a> - inspire future generations by providing the most exciting and diverse speakers for your conference. She built a simple database of great speakers for conference organizers to use when planning an event.</p>
<p>Kirstie took to heart the idea that to make our projects as welcoming as possible, we need to have clear and friendly communication. Even here, on her draft landing page, she makes a real effort to welcome everyone at the top.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/67"><img src="https://svbtleusercontent.com/tfevokpgie6wfa_small.png" alt="open-source-strangeloop-2016-068.png"></a></p>
<p>Looking back at the mentorship graph, Kirstie did such a great job explaining her project she was able to engage a couple contributors who did significant work building an MVP (minimum viable product). Kirstie has a background in neuroscience (not web development!), so watching her bring technologists and designers together to build something she is passionate about was really inspiring!</p>
<p>Now, as Kirstie begins her fellowship, she’s mentoring two projects including a group from the Detroit Community Technology Project. They’re <a href="https://github.com/MozillaFoundation/mozfest-program-2016/issues/669">addressing gentrification through storytelling</a> technology and plan to have a booth at MozFest.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/68"><img src="https://svbtleusercontent.com/qhl28xdhy1urg_small.png" alt="open-source-strangeloop-2016-069.png"></a></p>
<p>We have a few resources designed to help you write a good README and communicate your project.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/69"><img src="https://svbtleusercontent.com/e3pibjrdpp9kpq_small.png" alt="open-source-strangeloop-2016-070.png"></a></p>
<p>First is the <a href="http://mozillascience.github.io/working-open-workshop/writing_readme/">Open Project Communication handout</a>.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/70"><img src="https://svbtleusercontent.com/w8d1urgfbrxfg_small.png" alt="open-source-strangeloop-2016-071.png"></a></p>
<p>In the handout, we include the <a href="http://acabunoc.github.io/open-canvas">Open Canvas</a>, a tool I find very helpful when starting an open source project. Open Canvas is remixed from <a href="https://leanstack.com/lean-canvas/">Lean Canvas</a>, a popular tool from the startup world that helps you make a one page business plan.</p>
<p>I worked with <a href="https://twitter.com/jordanmayes">Jordan Mayes</a> from <a href="https://tophat.com/">Top Hat</a>, to remix this for open source projects. We removed some boxes that didn’t apply and added more thinking around community and contributors. You can read more about the process of creating Open Canvas in his <a href="https://medium.com/@jordanmayes/open-canvas-d6b2d346491c#.u60a10io8">blog post</a>.</p>
<p>The Canvas forces you to think through the problems you’re addressing and your proposed solution. We divide the canvas in two main sections, Product and Community, to get people to think about their community, what they’re building, and how others will get involved.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/71"><img src="https://svbtleusercontent.com/sqtqwbjorxxixw_small.png" alt="open-source-strangeloop-2016-072.png"></a></p>
<p>Next in our checklist is writing a roadmap, featuring Bastian.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/72"><img src="https://svbtleusercontent.com/wpvydksorfx2eq_small.png" alt="open-source-strangeloop-2016-073.png"></a></p>
<p>Here’s Bastian!</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/73"><img src="https://svbtleusercontent.com/gm6mhxsylf9amg_small.png" alt="open-source-strangeloop-2016-074.png"></a></p>
<p>When I first email introduced Bastian to his new mentee, his mentee replied with “Thanks for introducing the Mark Zuckerberg of open-source genetics! What a great mentor to have!” and linked to <a href="http://fusion.net/story/47945/this-guy-is-the-mark-zuckerberg-of-open-source-genetics/">this article</a>. I had no idea this article existed! But I am not surprised considering Bastian’s project.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/74"><img src="https://svbtleusercontent.com/oq66cqlwu1yw_small.png" alt="open-source-strangeloop-2016-075.png"></a></p>
<p>Bastian is a PhD student in bioinformatics, and was working on <a href="http://opensnp.org/">openSNP</a> (pronounced open snip). SNP stands for Single Nucleotide Polymorphism, a type of mutation that can occur in your DNA. openSNP let’s you upload your <a href="https://www.23andme.com">23andMe</a> (or any other genotyping service) results online. You can learn more about your results, find others with similar genetic variations, and help scientists discover more genetic associations.</p>
<p>When Bastian first uploaded his genetic data on GitHub (before he made openSNP), he received an email from someone who found the data online and analyzed his genetic report. The analysis said he might have an increased risk of prostate cancer. Since this type of mutation is inherited, he told his dad to go to the doctor. They found a tumour growing in his dad’s prostate, but they were able to catch it in early. His dad is alive and well today.</p>
<p>OpenSNP benefited greatly from going through the Roadmapping exercise! I worked with Bastian and Philipp (co-lead on the project) to plan out a few features and fixes that needed to be done. This helped then identify the need for new volunteers and shaped up a few projects they could submit to <a href="https://summerofcode.withgoogle.com/">Google Summer of Code</a> (GSoC). </p>
<p>By making a roadmap, they were able to accomplish a tremendous amount in a few short months. You can read about their GSoC experience on the <a href="https://opensnp.wordpress.com/2016/08/24/google-summer-of-code-wrap-up/">openSNP blog</a>.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/75"><img src="https://svbtleusercontent.com/eshh9tlxvgqk7a_small.png" alt="open-source-strangeloop-2016-076.png"></a></p>
<p>We have a <a href="http://mozillascience.github.io/working-open-workshop/roadmapping/">couple exercise</a> you can go through to write a roadmap for your project. Writing down what you plan on working on helps new contributors know where they can get involved.</p>
<p>A roadmap can be a simple as a collection of issues in your issue tracker to a comprehensive wiki outlining the future of your project.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/76"><img src="https://svbtleusercontent.com/ckvp1iofzkmspw_small.png" alt="open-source-strangeloop-2016-077.png"></a></p>
<p><a href="http://mozillascience.github.io/working-open-workshop/roadmapping/">This handout</a> walks you through picking a few milestones and breaking down the tasks needed to get there.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/77"><img src="https://svbtleusercontent.com/fauvnq2j8t85wq_small.png" alt="open-source-strangeloop-2016-078.png"></a></p>
<p>Next up is code of conducts with Richard.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/78"><img src="https://svbtleusercontent.com/n6kbfn2cgr9qlg_small.png" alt="open-source-strangeloop-2016-079.png"></a></p>
<p>You might notice from the graph that Richard wasn’t part of the first round of mentorship. Richard was actually a 2015 Mozilla Fellow for Science. He did some amazing open source work during his fellowship year, I thought he would be a great mentor.</p>
<p>Notice the moss beard in his avatar.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/79"><img src="https://svbtleusercontent.com/fma71elxkzipw_small.png" alt="open-source-strangeloop-2016-080.png"></a></p>
<p>Sadly, he doesn’t walk around with a moss beard in real life. This is a picture of Richard and his partner Steph at MozFest 2015. MozFest is so awesome that we had capes, buttons, <em>and</em> fox masks. You should come.</p>
<p>I listed Richard under code of conducts since he has an incredibly thoughtful approach when writing documentation for communities.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/80"><img src="https://svbtleusercontent.com/vobdpyy9sco2yw_small.png" alt="open-source-strangeloop-2016-081.png"></a></p>
<p>You can see the code of conduct Richard wrote in the last link, <a href="http://www.slidewinder.io/docs/01_code_of_conduct.html">Slidewinder Code of Conduct</a>.</p>
<p>In this particular code of conduct, he has a section called “Open [Source/Culture/Tech] Citizenship” that outlines the goals of having an open culture and encourages others to reward welcoming behaviour. I think this is incredibly important as we’re trying to build welcoming communities.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/81"><img src="https://svbtleusercontent.com/p1lfxdpyszvotq_small.png" alt="open-source-strangeloop-2016-082.png"></a></p>
<p>If you get stuck, <a href="https://github.com/mozillascience/code_of_conduct">Mozilla Science has a CC0 code of conduct</a> you’re free to take, remix, and use however you like!</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/82"><img src="https://svbtleusercontent.com/worg8zpjxhortg_small.png" alt="open-source-strangeloop-2016-083.png"></a></p>
<p>Next is Contributor guidelines and Tim.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/83"><img src="https://svbtleusercontent.com/cfiqxvls731o1q_small.png" alt="open-source-strangeloop-2016-084.png"></a></p>
<p><a href="http://twitter.com/betatim">Tim</a> was a physicist at CERN when we started the program. He recently moved to Zurich and is now a tech consultant.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/84"><img src="https://svbtleusercontent.com/ojbzkf0iamhnyw_small.png" alt="open-source-strangeloop-2016-085.png"></a></p>
<p>Tim was working on <a href="http://everware.xyz/">Everware</a>, a project trying to address reproducibility in scientific software. Everware uses <a href="https://www.docker.com/">Docker</a> to launch an instance of a <a href="http://jupyter.org/">jupyter notebook</a> directly from a GitHub repository.</p>
<p>Tim cares a <em>lot</em> about reseach reproducibility. I first met him at a hackathon a CERN where he first launched Everware and ruffled some feathers with his insistence that we need to focus on better research reproducibility.</p>
<p>Now, Tim’s mentoring two other groups including one looking at research reproducibility, <a href="http://refigure.org/">ReFigure</a>.</p>
<p>Tim and the other Everware developers wrote some great contributing guidelines that helped quite a few people get involved before and during the Global Sprint.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/85"><img src="https://svbtleusercontent.com/ubl5xay4kytyzg_small.png" alt="open-source-strangeloop-2016-086.png"></a></p>
<p>For resources, we have a <a href="http://mozillascience.github.io/working-open-workshop/contributing/">guide</a> that walks you through creating your contributing guidelines.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/86"><img src="https://svbtleusercontent.com/rvootrfpelna_small.png" alt="open-source-strangeloop-2016-087.png"></a></p>
<p>The file should be named CONTRIBUTING.md and placed in your root directory.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/87"><img src="https://svbtleusercontent.com/jrpve7qvr06b9q_small.png" alt="open-source-strangeloop-2016-088.png"></a></p>
<p>We break down the different parts of your contributing guidelines in the exercise.</p>
<p>Open with some cheer! You should celebrate someone looking to contribute to your project. Then, introduce the document and explain what these guidelines are for.</p>
<p>The bulk of the document should be some how to guides on contributing, along with expected norms the group follows, like a style guide.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/88"><img src="https://svbtleusercontent.com/jcu3ordznubzwq_small.png" alt="open-source-strangeloop-2016-089.png"></a></p>
<p>The CONTRIBUTING.md naming convention has become popular since GitHub integrates this in their interface. If there’s a CONTRIBUTING.md file in the root directory of a project, GitHub will display this notice as the top of the page whenever someone opens a new issue or pull request.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/89"><img src="https://svbtleusercontent.com/pn7cs5e3bd8gfa_small.png" alt="open-source-strangeloop-2016-090.png"></a></p>
<p>The last step is our catch-all for attitude and process, Mentorship. Here, I’m highlighting Madeleine since she’s done a great job including and delegating to others.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/90"><img src="https://svbtleusercontent.com/ujg57pl9qbjrfa_small.png" alt="open-source-strangeloop-2016-091.png"></a></p>
<p>Right off the bat, you can see how connected her node is in the graph since she’s been able to bring so many people into her work.</p>
<p><a href="https://twitter.com/mbonsma">Madeleine</a> is a PhD student at the <a href="https://www.utoronto.ca/">University of Toronto</a>. Madeleine not only runs an open source project with us, but she also runs a weekly open science meetup at her school, the <a href="https://uoftcoders.github.io/studyGroup/">UofT scientific coders</a>.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/91"><img src="https://svbtleusercontent.com/zv4lxsl8kwka_small.png" alt="open-source-strangeloop-2016-092.png"></a></p>
<p>Madeleine (on the left) actually spoke about running events at our Working Open Workshop because of her experience with the UofT scientific coders.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/92"><img src="https://svbtleusercontent.com/mdeknnugfbqahw_small.png" alt="open-source-strangeloop-2016-093.png"></a></p>
<p>Her project is <a href="https://science.mozilla.org/projects/pathogens">phageParser</a> which uses open data to better understand CRISPR systems. CRISPR is all the rage nowadays because it’s opened the door for faster and cheaper targeted gene editing.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/93"><img src="https://svbtleusercontent.com/zh4a1vzaugetxq_small.png" alt="open-source-strangeloop-2016-094.png"></a></p>
<p>CRISPR stands for Clustered Regularly Interspaced Short Palindromic Repeats. In the diagram the black diamonds are repeating DNA. In between the repeats are spacers. Spacers are pieces of DNA from a virus that attacked the system. The CRISPR system saves the virus DNA so that if it comes across the virus again, it can recognize it and cut it out, hence targeted gene editing.</p>
<p>Madeleine’s group realized that there are many openly published genomes with CRISPR systems. Her project is trying to collect and analyze these systems to try to find patterns and learn more about CRISPR.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/94"><img src="https://svbtleusercontent.com/odvudpzbokhfa_small.png" alt="open-source-strangeloop-2016-095.png"></a></p>
<p>Madeleine was able to engage so many people during the Global Sprint that she ran out of tasks for new contributors. I’ve noticed that Madeleine is naturally good at finding tasks and asking others for help, both in her project and with the UofT scientific coders.</p>
<p>At her first UofT scientific coders meeting, she delegated registering a club, managing the GitHub repository, and baking cookies for next week. Most of those people are now the green dots co-leading the group.</p>
<p>For those of us who need some instructions on how to delegate and involve others, we have a few exercises to help you start thinking about mentorship.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/95"><img src="https://svbtleusercontent.com/xsy4gzzwmhhuvg_small.png" alt="open-source-strangeloop-2016-096.png"></a></p>
<p><a href="https://wiki.mozilla.org/Good_first_bug">Good first bugs</a> can be a great way to give a new contributor a small win when they first start working on your project. Identify a few smaller issues that would be appropriate for someone completely new to the project. Ideally the hardest part of completeing this issue would be setting up their development environment.</p>
<p>This helps you reward new contributors sooner.</p>
<p>Another exercise that helps you think about a contributors progression through a project is the <a href="http://mozillascience.github.io/working-open-workshop/personas_pathways/">Personas & Pathways exercise</a>. This gets you to create a persona of an ideal contributor. Then, you can outline their pathway from when they first hear about the project, to their first contribution, to becoming a maintainer, to maybe even running the project when you’re ready to hand it off.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/96"><img src="https://svbtleusercontent.com/fv9gqud38cpbuw_small.png" alt="open-source-strangeloop-2016-097.png"></a></p>
<p>To summarize, these resources are helping us mobilize leaders in the open source movement.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/97"><img src="https://svbtleusercontent.com/3sdtk6bm8ce1wa_small.png" alt="open-source-strangeloop-2016-098.png"></a></p>
<p>Combined with trainings and mentorship, we’re working to fuel the open source movement in science, advocacy, learning, IoT and more.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/98"><img src="https://svbtleusercontent.com/frb5ejs60uf1g_small.png" alt="open-source-strangeloop-2016-099.png"></a></p>
<p>I mentioned <a href="http://mozillafestival.org/">MozFest</a> a few times, you should all come! It’s a lot of fun and you can meet a lot of people and projects I highlighted in this talk.</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/99"><img src="https://svbtleusercontent.com/lfnxoupywprx3g_small.png" alt="open-source-strangeloop-2016-100.png"></a></p>
<p>MozFest really is a place where “you can make things that matter”</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/100"><img src="https://svbtleusercontent.com/zhggq9yluqlhng_small.png" alt="open-source-strangeloop-2016-101.png"></a></p>
<p>Huge thanks to the many people that took part of the mentorship program as participants, mentors, and content creators. There’s a lot of people that made this happen.</p>
<p><a href="https://svbtleusercontent.com/nbbgkc67xnqtnw.gif"><img src="https://svbtleusercontent.com/nbbgkc67xnqtnw_small.gif" alt="awesome.gif"></a></p>
<p>You’ve all been awesome!</p>
<p><a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/103"><img src="https://svbtleusercontent.com/r216vgh8vdgw_small.png" alt="open-source-strangeloop-2016-104.png"></a></p>
<p>Thank you!<br>
Slides: <a href="https://acabunoc.github.io/open-source-strangeloop-2016/#/">acabunoc.github.io/open-source-strangeloop-2016</a></p>
<p><em>Note:<br>
I talk about projects from a lot of different fields in this presentation. I’m not an expert in all these fields, so I may have explained something wrong here. Happy to make corrections! Please be kind!</em></p>
tag:blog.abigailcabunoc.com,2014:Post/increasing-developer-engagement-at-mozilla-science-learning-advocacy2016-07-07T04:27:46-07:002016-07-07T04:27:46-07:00Increasing developer engagement at Mozilla {Science|Learning|Advocacy|++}<p>I love watching a community come together to solve problems.</p>
<p>The past two years, I’ve been testing ways to engage contributors on open source science projects. As Lead Developer for the <a href="http://science.mozilla.org/">Mozilla Science Lab</a>, I built prototypes in the open with our community while mentoring others to do the same. We’ve seen exponential growth in contributorship and mentorship, and I am incredibly proud of the work we accomplished.</p>
<p>I’m excited to be moving into a role where I’ll be extending the contributor pathways we’ve built in the Science Lab to other programs within the Foundation. As <strong>Lead Developer, Open Source Engagement</strong> at the Mozilla Foundation, I will be shaping how we interact with the open source community not just in Science, but also in <a href="https://learning.mozilla.org/">Learning</a>, <a href="https://advocacy.mozilla.org">Internet Policy & Advocacy</a> and newer efforts like <a href="https://blog.webmaker.org/exploring-the-internet-of-things-with-mozilla">Internet of Things</a> and <a href="https://blog.webmaker.org/new-partnership-with-un-women-to-teach-key-digital-skills-to-women">Women & Web Literacy</a>.</p>
<p>Mozilla’s mission is to ensure the Internet is a global public resource, open and accessible to all. People are key and necessary as we work towards Mozilla’s mission through the lens of each program.</p>
<h2 id="starting-in-science_2">Starting in Science <a class="head_anchor" href="#starting-in-science_2">#</a>
</h2>
<p>Starting this experiment among academic researchers in Mozilla Science helped prepare us to reach the broader Mozilla community.</p>
<p>The Science Lab community is a cross section of Mozilla’s community. Within Mozilla Science, we’ve hosted projects exploring <a href="https://app.mozillafestival.org/#_session-123">IoT</a>, <a href="https://app.mozillafestival.org/#_session-290">research policy</a>, <a href="https://science.mozilla.org/projects/KirstieJane-STEMMRoleModels">women in STEMM</a>, education and more. These projects helped us learn how to engage a diverse community.</p>
<p>Bringing the concept of working openly to academic research has helped us understand a wide array of complex challenges. The research world is full of competition, private data and a cutthroat need to publish. These challenges forced to us articulate why open matters and emphasize a scalable mentorship model as we work towards culture change.</p>
<h2 id="contributor-pathways_2">Contributor Pathways <a class="head_anchor" href="#contributor-pathways_2">#</a>
</h2>
<p>Modelling the contributor pathways we used within the Science Lab, we’ve found four stages needed to create a cohesive pathway for contributors.</p>
<p><img src="https://cloud.githubusercontent.com/assets/617994/16631566/b83a6f76-438d-11e6-8e59-2ebfb0a0ab9d.png" alt="contributor pathways"></p>
<ol>
<li>
<strong>Sourcing:</strong> Finding new contributors. This can be passively on a project or more actively at an event or in a specific ask.</li>
<li>
<strong>Onboarding:</strong> Intentionally onboard new contributors to answer:
<ul>
<li>
<strong>WHY:</strong> Why Mozilla? Why open source?</li>
<li>
<strong>HOW:</strong> How do they practically contribute? What steps or skills should they know?</li>
</ul>
</li>
<li>
<strong>Prototyping:</strong> We need to build <em>with</em> our community. This gives contributors a chance to learn and practice collaborating while building new features or tools.</li>
<li>
<strong>Training & Mentorship:</strong> While prototyping, work is constantly acknowledged, rewarded and refined. As contributors learn to bring others into their work, they may take on the mentor role to newcomers.</li>
</ol>
<p>Taking these ideas, I’ll be working to see how we can define and measure a contributor pathway across the Mozilla Foundation.</p>
<h2 id="what-next_2">What Next? <a class="head_anchor" href="#what-next_2">#</a>
</h2>
<p>Over the next few months, we’ll be studying how different programs across the Mozilla Foundation work with their contributors. At the same time, I’ll be continuing to work with the volunteers and mentors in the Science Lab.</p>
<p>I’d love to hear your thoughts and feedback on the idea of setting contributor pathways across the Foundation. You can reach me on twitter <a href="https://twitter.com/abbycabs">@abbycabs</a> or email me directly at abby at mozillafoundation.org.</p>
<p>We’re entering an exciting time at the Mozilla Foundation as we break out of our prototyped, siloed programs and share how we’ve been successful. The Mozilla Science Lab - and our other community centred programs - will be so much stronger as we collaborate across our combined networks. Together, let’s build a better internet! </p>
tag:blog.abigailcabunoc.com,2014:Post/what-i-learned-working-at-wormbase-oicr2014-09-09T05:34:10-07:002014-09-09T05:34:10-07:00What I learned working at WormBase / OICR<p>Three weeks ago, I left the <a href="http://www.oicr.on.ca">Ontario Institute for Cancer Research</a> (OICR) to join the <a href="http://mozillascience.org">Mozilla Science Lab</a>. Yesterday would have been my five year work anniversary at OICR. Since I don’t get a plaque now, this seemed like a good time to reflect on what I’ve learned as I begin a new chapter at the Mozilla Science Lab.</p>
<p>Majority of my time at OICR, I served as lead developer on the <a href="http://www.wormbase.org">WormBase</a> project. I learned a lot about software, leading development teams and dealing with biological data; But my biggest takeaways came from watching the interaction between the scientific research community and the web.</p>
<p>Here are three lessons I learned in my five years at WormBase / OICR:</p>
<p><strong>1. A simple web app can have a huge impact on a research community</strong></p>
<p>WormBase is a highly curated biological database for nematode (aka roundworm) research. I worked to make it as easy as possible for researchers to find and consume this information.</p>
<p>It took me a few years to realize how unique WormBase is - an entire research community, <a href="http://cedevap2014.sakura.ne.jp/">spanning</a> <a href="http://www.union.wisc.edu/ceaging/">several</a> <a href="http://www.union.wisc.edu/CeNeuro/index.html">topics</a>, depends on this tool. The information there is vital to getting new students up to speed and it also facilitates insights and new discoveries. Thanks to regular worm meetings, you don’t have the regular barriers between fields within the worm community.</p>
<p>WormBase and the worm research community have helped each other grow over the years. Having all this information available has been hugely beneficial to anyone interested in nematodes. I want to see this happening in more areas of science.</p>
<p><strong>2. Open source and open access makes science better</strong></p>
<p>We were able to build WormBase with a small development team by using many open source tools. From the web framework (<a href="http://www.catalystframework.org/">Catalyst</a>) to bioinformatics tools (<a href="http://gmod.org/wiki/GBrowse">GBrowse</a>, <a href="http://intermine.github.io/intermine.org/">Intermine</a>, more), most of WormBase was written by other people. I am grateful that so many bioinformatics research groups have embraced open source and given us tools that make the web useful for science.</p>
<p>OICR has <a href="http://en.wikipedia.org/wiki/Lincoln_Stein">some</a> <a href="http://www.bioinformatics.org/franklin/2004/">great</a> <a href="https://twitter.com/bffo">champions</a> for open source and open access in the research community. They understand that the best ideas don’t always come to the people who have access to resources today. Working beside these giants, it’s easy to imagine a world where any researcher - even the lowly undergrad - has access to tools and data to help make discoveries and further science.</p>
<p><strong>3. Doing this right is hard. We need to communicate</strong></p>
<p>Mistakes are made. Development resources and talent aren’t always available in the research world. Barriers to access range from technical to legal to personal. I don’t always understand what a worm researcher (or any researcher!) is looking for.</p>
<p>I want to see this all work - I want more discoveries, more tools and more science. But I’ve learned that this takes a lot of communication to do properly. WormBase is a huge <a href="http://www.wormbase.org/about/staff">team</a> with even more stakeholders. It works because they know and are involved in their community; They understand what they can do to help.</p>
<p>By contrast, I’m joining a team that serves a much larger research community (i.e. the <em>whole</em> research community) that I personally don’t fully understand. I am so thankful that Mozilla Science Lab is full of volunteers spanning a wide set of research fields. Together, we can understand this space and help research thrive on the open web.</p>
<p><small>Join us: <a href="https://wiki.mozilla.org/ScienceLab/Calls">community call</a>, <a href="https://mail.mozilla.org/listinfo/mozillascience">mailing list</a>, <a href="https://twitter.com/MozillaScience/">@MozillaScience</a>. And of course, you can always reach me on twitter <a href="https://twitter.com/abbycabs">@abbycabs</a>.</small></p>
tag:blog.abigailcabunoc.com,2014:Post/joining-the-mozilla-science-lab2014-08-21T04:52:40-07:002014-08-21T04:52:40-07:00Joining the Mozilla Science Lab!<p>Breaking news: I’ve <a href="http://mozillascience.org/welcoming-two-new-team-members-abby-cabunoc-and-bill-mills/">joined</a> the über-talented team at the <a href="http://mozillascience.org/">Mozilla Science Lab</a> as lead developer! I’ll be leading technical prototyping efforts and engaging the community about our <a href="http://mozillascience.org/code-as-a-research-object-updates-prototypes-next-steps/">technical</a> <a href="http://mozillascience.org/code-review-for-science-what-we-learned/">projects</a>.</p>
<h4 id="why-mozilla-science-lab_4">Why Mozilla Science Lab? <a class="head_anchor" href="#why-mozilla-science-lab_4">#</a>
</h4>
<p>From <a href="http://mozillascience.org/sample-page/">mozillascience.org</a>:</p>
<blockquote class="short">
<p>The Mozilla Science Lab is a new initiative that will help researchers around the world use the open web to shape science’s future.</p>
</blockquote>
<p>I have unashamedly fallen in love with the ideals of open source and open science. I’m enamored with what openness means and what it could look like in the scientific community.</p>
<p>The need for openness in research is there: I’ve seen the struggles of data sharing, the fear of collaborating and the uncertainty of best practices. It leaves you with duplicated efforts and more file types than you can count. On the other hand, I’ve witnessed the beauty of open source software driving analysis and innovation within a community. I’ve watched ideas spark when communication lines open up. The time I spent at <a href="http://oicr.on.ca/">OICR</a> and <a href="http://wormbase.org/">WormBase</a> introduced me to openness in science in a tangible way – and it looks <em>good</em>.</p>
<p>I joined the Mozilla Science Lab because I love their mission of <strong>making the web work for science</strong>. This group has the power and means to change the culture within the research community.</p>
<p>There is incredible potential when you apply a movement that wants to “<a href="https://air.mozilla.org/nature-of-mozilla/">build the internet the world needs</a>” to scientific research – a discipline that desperately needs an open internet to build on, but doesn’t quite know it yet.</p>
<h4 id="what-now_4">What Now? <a class="head_anchor" href="#what-now_4">#</a>
</h4>
<p>It’s been a few days since I joined Mozilla and I’m already inspired by the community and Mozillians surrounding me. These people gathered around a <a href="https://www.mozilla.org/en-US/mission/">shared mission</a> – one that has and will continue to change the world we live in.</p>
<p>In the Science Lab, I’m getting to know the different people and projects involved (more on the projects soon!). These efforts would be nothing without the community (ie YOU). From researchers to developers to educators, we are here to <a href="http://mozillascience.org/get-involved/">help you learn, build and connect to others</a> with the same mission.</p>
<p>So come! Help us make research more like the web: open, collaborative and accessible.</p>
<p><small>Join us: <a href="https://wiki.mozilla.org/ScienceLab/Calls">community call</a>, <a href="https://mail.mozilla.org/listinfo/mozillascience">mailing list</a>, <a href="https://twitter.com/MozillaScience/">@MozillaScience</a>. And of course, you can always reach me on twitter <a href="https://twitter.com/abbycabs">@abbycabs</a>.</small></p>
tag:blog.abigailcabunoc.com,2014:Post/biocuration-2014-battle-of-the-new-curation-methods2014-04-16T14:10:37-07:002014-04-16T14:10:37-07:00Biocuration 2014: Battle of the new curation methods<p>Biocuration is incredibly important to progress in science. The process of sorting through and annotating scientific data to make it available and searchable to the public is at the heart of the ideas behind <strong>open science</strong>. I work at <a href="http://www.wormbase.org">WormBase</a> because I believe in its mission to curate our knowledge of nematode biology to make it freely available to the scientific community.</p>
<blockquote class="twitter-tweet" lang="en">
<p><a href="https://twitter.com/search?q=%23isb2014&src=hash">#isb2014</a> great turnout <a href="http://t.co/V14if4ecgr">pic.twitter.com/V14if4ecgr</a></p>— Paul Davis (@bayamo2003) <a href="https://twitter.com/bayamo2003/statuses/453173599684554752">April 7, 2014</a>
</blockquote>
<p>The Seventh <a href="http://biocuration2014.events.oicr.on.ca/biocuration"><strong>International Biocuration Conference</strong></a> (ISB2014) was held at the University of Toronto last week. The theme of the conference this year was <em>“Bridging the gap between genomes and phenomes”</em>, focusing on bringing the results of the biocuration efforts to the clinicians. However, a slightly different theme stood out to me during the meeting - the tension between different methods for improved curation. </p>
<p>There was a clear consensus that we’ve come to an inflection point in this field. It’s no longer worthwhile, or even possible, to have detailed manual curation for each piece of biological information. Data is being generated (see <a href="http://en.wikipedia.org/wiki/Next-generation_sequencing">NGS</a>) and papers are being published at a tremendous rate (<a href="http://www.slideshare.net/goodb/mturk-biocuration2014-pdf/2">>100 publications/hour</a>). Human eyes can’t keep up.</p>
<blockquote class="twitter-tweet" lang="en">
<p>“That is the slide”. Not a mistake, showing mess abundant data <a href="https://twitter.com/search?q=%23isb2014&src=hash">#isb2014</a> <a href="http://t.co/uRG6wAShEB">pic.twitter.com/uRG6wAShEB</a></p>— Marc RobinsonRechavi (@marc_rr) <a href="https://twitter.com/marc_rr/statuses/453177036262359040">April 7, 2014</a>
</blockquote>
<p>We need to look at data as a whole. Many groups have come up with ways to automate/distribute the biocuration process, with a focus on information extraction from text. Three main approaches were presented: (dictionary based) <strong>text mining</strong>, <strong>machine learning</strong> and <strong>crowdsourcing</strong>. While biocurators are civil individuals, there’s still a sense of competition between the different methods and tools.</p>
<blockquote class="twitter-tweet" lang="en">
<p>Big Data Curation panel. My current status <a href="https://twitter.com/search?q=%23ISB2014&src=hash">#ISB2014</a> <a href="http://t.co/y2QYjX21WC">pic.twitter.com/y2QYjX21WC</a></p>— Abigail Cabunoc (@abbycabs) <a href="https://twitter.com/abbycabs/statuses/453621447940780033">April 8, 2014</a>
</blockquote>
<p><small>‘Best Tweet’ award winner at #ISB2014 by yours truly. <em>Disclaimer: I did not create this meme. I saw <a href="https://twitter.com/escalant3">@escalant3</a> RT it awhile ago</em></small></p>
<p>We’re at a period of unrest while the community is deciding how to handle this ‘Big Data’ we’re faced with. In the coming years, we’ll see best practices and standard tools emerge. In the meantime, here’s a brief overview of the work presented in this area.</p>
<h1 id="text-mining_1">Text mining <a class="head_anchor" href="#text-mining_1">#</a>
</h1>
<p>Text mining based on a knowledge dictionary: this has been a friend of biocuration for a long time. Everyone and their uncle has a text mining tool and strategy they love and support! Most tools focused on being a first-pass on a paper or abstract to help call out/screen information before an expert curator takes a closer look. The <a href="http://www.biocreative.org/">BioCreative</a> workshop in particular demonstrated some of the recent work in this area.</p>
<p>Overall, the community is generating much more usable and intuitive text mining tools meant to be used as a first-pass for biocuration. A couple tools that stood out to me: </p>
<ul>
<li>
<a href="http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/">PubTator</a>: If you have a list of articles, this helps you sort through and find the most relevant publications to focus on</li>
<li>
<a href="http://factoid.baderlab.org/">Factoid</a> - Bader Lab: Turns an abstract into an editable model of biological processes. Really nice UI, using Cytoscape.</li>
</ul>
<p><small>Notably missing from the meeting: <a href="http://www.textpresso.org/">Textpresso</a> from WormBase.</small></p>
<p>The tools in this space are getting more usable and accurate. However, they still require an expert curator to look at the results. We saw in some talks that this approach may not perform as accurately as some machine learning algorithms. I’m interested to see if the research and development focus will shift away from text mining in the coming years.</p>
<h1 id="machine-learning_1">Machine learning <a class="head_anchor" href="#machine-learning_1">#</a>
</h1>
<p>Just a few years ago, machine learning algorithms weren’t preforming as well as text mining on biological data. However, with larger and larger datasets becoming normal, machine learning has begun to surpass text mining in accuracy in some literature-based curation tasks. <small>(Gobeill, <em>Supervised text mining for functional curation of gene products: how the data and performances have evolved in the last ten years</em> [<a href="http://etherpad.wikimedia.org/p/isb2014-functional">notes - line #582</a>])</small></p>
<p>More researchers are writing machine learning algorithms to extract information from their data. So far, these are generally ad-hoc and highly specialized algorithms, with some exceptions (GOCat). We are beginning to see some user-centred tools powered by machine learning algorithms, and I hope to see even more in the future.</p>
<ul>
<li>
<a href="http://eagl.unige.ch/GOCat/">GOCat</a>: Offers both dictionary based and machine learning models for extracting Gene Ontology terms from text. </li>
<li>
<a href="http://compsysbio.org/gist">GIST</a>: using machine learning to provide improved annotations for species from reads [<a href="http://etherpad.wikimedia.org/p/isb2014-microbe">notes - line #130</a>]</li>
</ul>
<h1 id="crowdsourcing_1">Crowdsourcing <a class="head_anchor" href="#crowdsourcing_1">#</a>
</h1>
<p>Crowdsourcing is the cool kid in this space. Science has a history of failing where Wikipedia and others have succeeded. But this meeting showed a couple promising approaches to crowdsourcing in biocuration.</p>
<p><a href="https://twitter.com/bgood">Ben Good</a> deservedly won the ‘Best Presentation’ award for his talk, <em>Microtask crowdsourcing for disease mention annotation in Pubmed abstracts</em>. An interesting use of the <a href="https://www.mturk.com/mturk/">Amazon Mechanical Turk</a> applied to science. [<a href="http://www.slideshare.net/goodb/mturk-biocuration2014-pdf">slides</a>]</p>
<p><a href="http://genomearchitect.org/">Apollo</a>: less about information extraction from text, more about community genome annotation [<a href="http://www.slideshare.net/MonicaMunozTorres/threes-a-crowdsource-observations-on-collaborative-genome-annotation">slides</a>]</p>
<p>This approach is shiny and full of potential. I think many scientists in the audience were inspired by the idea of microtask crowdsourcing in particular. While I think the ‘microtask’ of interpreting biomedical literature is unusually difficult, there are huge possibilities in this space if the right tools and approaches are developed. </p>
<h1 id="conclusion_1">Conclusion <a class="head_anchor" href="#conclusion_1">#</a>
</h1>
<p>These are the unsolicited opinions of one web developer with a particular interest in WormBase on the state of biocuration today. There is a lot of innovation in this space - I’m excited to see what happens in the next few years. Don’t worry about being replaced, biocurators! Even with all the automation going on, everyone agrees that biocurators are needed more than ever.</p>
<blockquote class="twitter-tweet" lang="en">
<p><a href="https://twitter.com/search?q=%23ISB2014&src=hash">#ISB2014</a> L Stein says there are in fact, jobs for biocurators <a href="http://t.co/BGzpk21rbP">pic.twitter.com/BGzpk21rbP</a></p>— Melissa Haendel (@ontowonka) <a href="https://twitter.com/ontowonka/statuses/453886106438995968">April 9, 2014</a>
</blockquote>
<p><small>Biocuration jobs: (<strong>metadata</strong>) massage therapist, (<strong>data</strong>) wrangler, (<strong>complex data</strong>) modeler</small></p>
<h1 id="further-reading_1">Further reading <a class="head_anchor" href="#further-reading_1">#</a>
</h1>
<ul>
<li><a href="http://biocuration2014.events.oicr.on.ca/files/abstractbooklet.pdf">Abstract booklet - ISB2014</a></li>
<li><a href="http://wiki.wormbase.org/index.php/ISB2014_group_notes">Collection of speaker slides and notes</a></li>
<li><a href="http://etherpad.wikimedia.org/p/isb2014">Etherpad used at ISB2014</a></li>
<li><a href="https://twitter.com/search?q=isb2014&src=typd">#ISB2014 tweets</a></li>
<li><a href="http://f1000.com/posters/browse?conferenceId=259713564">ISB2014 posters available on F1000</a></li>
</ul>
tag:blog.abigailcabunoc.com,2014:Post/wormbase-website-and-biocuration2014-04-04T13:51:38-07:002014-04-04T13:51:38-07:00WormBase Website and Biocuration<p>The <a href="http://biocuration2014.events.oicr.on.ca/biocuration">Seventh International Biocuration Conference (ISB2014)</a> begins tonight here in Toronto.</p>
<h4 id="correction-poster-44_4">Correction: Poster #44! <a class="head_anchor" href="#correction-poster-44_4">#</a>
</h4>
<p><a href="http://wiki.wormbase.org/images/Isb_poster.pdf">Poster: WormBase Website: Supporting the Biocuration Process</a></p>
<hr>
<h3 id="wormbase-website-supporting-the-biocuration-p_3">WormBase Website: Supporting the Biocuration Process <a class="head_anchor" href="#wormbase-website-supporting-the-biocuration-p_3">#</a>
</h3>
<p>Abigail Cabunoc, Todd W. Harris, Lincoln D. Stein</p>
<p>WormBase (<a href="http://www.wormbase.org/">http://www.wormbase.org/</a>) is a highly curated central data repository for Caenorhabditis biology. Our objective is to capture the wealth of experimental data available from C. elegans and related nematodes via published literature and personal communication, and present it to the research community in a way that facilitates new biological insights. Although the website is geared towards end users, we added several features to support the biocuration process.</p>
<p>Flexible views were a central design factor in the new website allowing users to customize the information presented to them. We extended this customizability to WormBase curators, with a “Curator only” view. This view allows our curators to view specific metadata related to the curation process and use tools for exploring the arcana of the underlying data model that are not available to the general public.</p>
<p>The ability for curators to add real-time annotations to the website was added as a response to the current lag between data curation, integration, database build and website display. Curators use this to create up-to-date summaries and descriptions for each species or data class, typically information not specifically tied to any release of the website. Such a system could also be used by end users to annotate current data in real-time.</p>
<p>The website was also redesigned to help encourage community annotations and engage the community in the curation process. Every page on the site has a tab prompting users to submit any content corrections or feedback they may have. Users also have the ability to create public comments directly on a report page in WormBase. </p>
<p>Aiding biocuration was one of the main goals of the website redesign. This has provided a space for real time updates, customized views of the data for curators and increased community engagement. While these features are currently available on the website, more work can be done to use them effectively and help bridge the gap between curators and the community.</p>
tag:blog.abigailcabunoc.com,2014:Post/big-data-in-biology-big-data-challenges-and-solutions-control-access-to-individual-genomes2014-04-01T06:05:57-07:002014-04-01T06:05:57-07:00Big Data in Biology: Big Data Challenges and Solutions: Control Access to Individual Genomes<p>Series Introduction: I attended the <a href="http://www.keystonesymposia.org/14F2">Keystone Symposia Conference: Big Data in Biology</a> as the Conference Assistant last week. I set up an Etherpad during the meeting to take <a href="http://ksbigdata.titanpad.com/3">live notes</a> during the sessions. I’ve compiled all the abstracts, notes and slides (where available) here. Shout-out to <a href="https://twitter.com/dkuo">David Kuo</a> for helping edit the notes.</p>
<p><em>Warning: These notes are somewhat incomplete and mostly written in broken english</em></p>
<h3 id="other-posts-in-this-series_3">Other posts in this series: <a class="head_anchor" href="#other-posts-in-this-series_3">#</a>
</h3>
<ul>
<li><a href="/big-data-in-biology">Big Data in Biology</a></li>
<li><a href="/big-data-in-biology-largescale-cancer-genomics">Big Data in Biology: Large-scale Cancer Genomics</a></li>
<li><a href="/big-data-in-biology-databases-and-clouds">Big Data in Biology: Databases and Clouds</a></li>
<li><a href="/big-data-in-biology-personal-genomes">Big Data in Biology: Personal Genomes</a></li>
<li><a href="/big-data-in-biology-imagingparmacogenomics">Big Data in Biology: Imaging/Parmacogenomics</a></li>
<li><a href="/big-data-in-biology-databases-and-clouds#schatz">Big Data in Biology: The Next 10 Years of Quantitative Biology</a></li>
</ul>
<h1 id="panel-big-data-challenges-and-solutions-contr_1">Panel: Big Data Challenges and Solutions: Control Access to Individual Genomes <a class="head_anchor" href="#panel-big-data-challenges-and-solutions-contr_1">#</a>
</h1>
<p>Monday, March 24th, 2014 2:15pm - 4:00pm<br><br>
<a href="http://ks.eventmobi.com/14f2/agenda/35704/288348">http://ks.eventmobi.com/14f2/agenda/35704/288348</a></p>
<h4 id="panel-members_4">Panel members <a class="head_anchor" href="#panel-members_4">#</a>
</h4>
<ul>
<li>
<em>Moderator</em> - Doreen Ware (<strong>DW</strong>), Cold Spring Harbor Laboratory, USA </li>
<li>David Haussler (<strong>DH</strong>), University of California, Santa Cruz, USA </li>
<li>Laura Clarke (<strong>LC</strong>), European Bioinformatics Institute, UK </li>
<li>Jill P. Mesirov (<strong>JM</strong>), Broad Institute, USA </li>
<li>Andrew Carroll (<strong>AC</strong>), DNAnexus, USA </li>
<li>Lincoln D. Stein (<strong>LS</strong>), Ontario Institute for Cancer Research, Canada </li>
<li>Mark Gerstein (<strong>MG</strong>), Yale University, USA </li>
</ul>
<h4 id="notes_4">Notes <a class="head_anchor" href="#notes_4">#</a>
</h4>
<p><strong>DW:</strong> Interaction between panel members/audience<br><br>
Started planning this meeting almost 1.5 years ago - we decided that controlled access would be a main talking point</p>
<h5 id="challenges-and-opportunities_5">Challenges and opportunities <a class="head_anchor" href="#challenges-and-opportunities_5">#</a>
</h5>
<ul>
<li>scale (volume)</li>
<li>variety - the heterogeneity of the data. Representation and analysis of this. How will we deal with metadata? how to integrate?</li>
<li>timeliness - velocity, getting data, operated on, updates</li>
<li>privacy, topic of this session - key point</li>
<li>usability, want this data to be useful - accept human input and support collaborations. Interpretation of the data.</li>
</ul>
<h5 id="personal-genomes_5">Personal Genomes <a class="head_anchor" href="#personal-genomes_5">#</a>
</h5>
<ul>
<li>1000 genomes</li>
<li>publishing their own genomes</li>
<li>personal genomes project (George church)</li>
<li>GigaDB - liver cancer patients</li>
<li>more examples of having access to this data is not easy! privacy and bio-ethics.
nature - privacy protections the genome hacker.
some of the privacy we think we have isn’t as private as we think.
‘anonymized’ sets - can be identified by combining the data. How will we handle integration?</li>
</ul>
<h4 id="panel-introductions_4">Panel Introductions <a class="head_anchor" href="#panel-introductions_4">#</a>
</h4>
<p><strong>Mark Gerstein</strong> (<strong>MG</strong>) - Yale (bioinformatics)</p>
<ul>
<li>originally worked in model organisms</li>
<li>transitioned to human genomic - scale, but not really privacy issue</li>
<li>disease genomics - privacy issues!</li>
</ul>
<p><strong>David Haussler</strong> (<strong>DH</strong>) - UCSC</p>
<ul>
<li>running into all kinds of data issues</li>
<li>go through long protocol to get to all data sets</li>
<li>cancer datasets - didn’t make it clear didn’t make it clear it was a childhood cancer study, was rejected</li>
<li>subtle consents get crazy</li>
</ul>
<p><strong>Laura Clarke</strong> (<strong>LC</strong>) 1000 genomes project</p>
<ul>
<li>managed access data on some projects - trying to make applications as lightweight as possible</li>
<li>new open/managed accessed project - not clear how to make useful</li>
</ul>
<p><strong>Lincoln Stein</strong> (<strong>LS</strong>) OICR</p>
<ul>
<li>works with ICGC DCC - make cancer genome datasets available as frictionless as possible</li>
<li>open and controlled tiers</li>
<li>main concern: maximize access to data - make useful- do not violate donors trust.
donated under agreement used for research and no other purposes (identification)</li>
</ul>
<p><strong>Jill Mesirov</strong> (<strong>JM</strong>) Broad</p>
<ul>
<li>most of the time collaborators worry about permissions etc.</li>
<li>there’s a tension </li>
<li>mostly clinical studies - patients want to do whatever they can to help us understand their diseases.
BUT: learned that consents that they sign aren’t necessarily consumable by the average person</li>
<li>many patients don’t understand - if i share my data it doesn’t just affect me, but my relatives and other people that share large pieces of their genome with me</li>
<li>other issue: ethical/legal.
a lot of the problems with disclosing the identity of the patients data and clinical info and genetic info is that it can affect things like hiring, insurance, liability.
these risks need to be clear to them</li>
</ul>
<p><strong>Andrew Carroll</strong> (<strong>AC</strong>) DNAnexus</p>
<ul>
<li>used 1000 genomes data</li>
<li>used CHARGE consortium data - under IRB restrictions.
only combine in appropriate way, keep data flowing consistently</li>
<li>used pharma company sequencing data - internal for r&d</li>
</ul>
<p><strong>DW: Q: Are the current support systems right now sufficient?</strong></p>
<p><strong>LS:</strong> The issue a lot of researchers are encountering - like cell phone makers, every phone contains thousands of licensed technology: Need to negotiate with each maker of hardware/software component. It’s beginning to get a lot like that in genomics - each one is consented for different rules. Cancer research, pediatric research, general research… must observe restrictions on each of the components. Makes it very difficult to combine two datasets. Even using controls - can’t use other sets as normal controls in a cancer study if they only consented for a diabetes (or other) study. Need uniformed consent - stop focusing on dataset and focus on researcher - have an ethically approved researcher status. If I pass every year</p>
<p><strong>JM:</strong> One of the things I observed at Broad - datasets will come in to Broad and will take on a life of its own. Shared in ways that are not appropriate for consent - through ignorance. Implications for the data not made clear. We put in place a training program around how you handle this kind of data, and to minimize the replication of this kind of data. Got authorisation - did not duplicate the data. Track better who is accessing what.</p>
<p><strong>LC:</strong> As we move towards centralised compute and moving analysis to compute - these sorts of challenges will be easier. One of the key points of making this data useful is better defined consent.</p>
<p><strong>MG:</strong> I second the points of LS and JM. Most people who would inappropriate use private data is accidental - ignorance. They do that because it’s easier - just copy dataset, don’t go through protocols. Need to make good tools and infrastructure so there’s no incentive to do it wrong.</p>
<p><strong>DW: Moving forward (question to JM), do you think there’s a need for some sort of education on handling this data?</strong></p>
<p><strong>JM:</strong> Yes, especially trainees who are beginning their research career. Human subject certification test - goes on forever. These are the key important things you need to understand: These people are giving you a gift, something very personal about themselves for you to further your knowledge and help treat the disease. In turn you have to respect that, and here are some simple rules on how to do this.</p>
<p><strong>AC:</strong> Looking at this in a technical sense. Many people working on this in a flexible way, someone will make a mistake. We need to architect technical solutions that make it easier for the graduate student to not make a mistake.</p>
<p><strong>MG:</strong> Not only the grad students - in clinical orientation. Clinicians sloppy about where they put their data and how they move it. Need to educate.</p>
<p><strong>LS:</strong> There’s a lot of debate in the USA on the legality on putting genomic data in clouds. Misguided debate - more secure than letting grad students play with it on laptops.</p>
<p><strong>DW: do you think the compute clouds are secure enough to share among collaborators</strong></p>
<p><strong>DH:</strong> Appropriate levels, the cloud vendors can be more secure than the NSA. It’s going to be so much more secure than at any medical institution. Need to work with the cloud vendors to come to terms with a compliance framework.<br>
Institutions may not want to change for historical regions (e.g. consent forms specifying where data stored). Why does banking accept cloud and not NIH?</p>
<p><strong>DW: Are the current restrictions on whole genomes too restrictive?</strong></p>
<p><strong>AC:</strong> depends on what you want to do, how ambitious you want to be. There’s an immense amount to discover. If it’s not acted on in an academic setting, pharma will go out and sequence their own pools and make their own discoveries. The value is there - if there isn’t a means to get at it they’ll find their own way.</p>
<p><strong>DH:</strong> There is a willingness to use controls and share in Pharma</p>
<p><strong>LC:</strong> Pharma doesn’t want to make this massive investment by themselves individually.</p>
<p><strong>LS:</strong> technology is enabling lots of things. Patients that have a serious disease are very willing to share their genomic data for the greater good if its handled appropriately. PMH (Princess Margaret Hospital, Toronto) study - has sociology group to get attitudes on genomic sequencing. When patients were asked: </p>
<ol>
<li>‘would you be willing to share your mutations with researchers?’: 100% positive response rate.<br>
</li>
<li>‘would you share your germline polymorphism around areas relevant to your cancer’: still positive responses<br>
</li>
<li>‘would you share incidental findings’: Complete drop off - almost nobody in the study wanted incidental findings disclosed.<br>
</li>
</ol>
<p>Need to rework regulatory framework and the way consents are posed in order to address the read and perceived harms to patients/donor/ family members</p>
<p><strong>JM:</strong> This is an area of intense activity. Regulations/consents/risks conveyed. It’s a tricky business.</p>
<p><strong>Q:</strong> (Ouellette) <strong>What if people were told that these germline/incidental findings would help others?</strong></p>
<p><strong>LS:</strong> the way you ask greatly affects response. Wording. We want to look directly at what the short term and potential long term harms are. Short term - non-paternity. There will be people trying to figure out if their friends/neighbours are in a cancer db. </p>
<p><strong>MG:</strong> I am a privacy advocate in this context. What is the harm that can happen? People don’t know exactly what the disclosure of their genetic info will affect. There will be a major harm to genomics and bioinformatics as a field if people commit stunts/db gets hacked and break privacy. We have to think about how this reflects on teh field. It’s a potentially a bad thing - consent implies they’re really trusting us. You really have to understand the trust. If you breach the trust, everyone looks bad.</p>
<p><strong>LC:</strong> Mark (MG), what do you think will be appropriate consequences? How do we maintain the societies trust?</p>
<p><strong>MG:</strong> Concept of license (LS) - you are a responsible researcher, prove it and update it.</p>
<p><strong>JM:</strong> Read Yaniv Erlich’s paper. We need to understand what data we can and what data we shouldn’t share. Some data was on ancestry.com - the db had addresses (city and state locations). It wasn’t the case that he went to a repository of data that was just genomic data. He was very clever, used a lot of ancillary information around the particular genomes to get that information. Great paper, raises a lot of issues. We could be disclosing identity.</p>
<p><em>NB: Paper mentioned:</em><br><br>
Gymrek, Melissa, et al. “Identifying personal genomes by surname inference."Science 339.6117 (2013): 321-324.<br><br>
<a href="http://data2discovery.org/dev/wp-content/uploads/2013/05/Gymrek-et-al.-2013-Genome-Hacking-Science-2013-Gymrek-321-4.pdf">http://data2discovery.org/dev/wp-content/uploads/2013/05/Gymrek-et-al.-2013-Genome-Hacking-Science-2013-Gymrek-321-4.pdf</a></p>
<p><strong>DH:</strong> We need to start thinking about privacy in terms of granular facts and how they are linked. Separate the idea of what is information that can be public and associations between those that are private. Internet of things/facts - if you can link multiple of facts to the same person it causes a violation of privacy. Share private information - sharing the linkage -> previously anonymous data becomes controlled information. </p>
<ol>
<li>who can see it </li>
<li>for what purpose. </li>
</ol>
<p>Need to have remedies and ways of looking at controlling and approaching privacy.<br>
Linking too many facts about one person</p>
<p><strong>AC:</strong> Where this chain is broken - where someone is able to tie outside information to a piece of genomic sequence. It becomes easy to identify everyone related. For example, Bitcoin - if you can break some of these hashes, you can determine entire transactional history. Single break in the link will expose many people.</p>
<p><strong>LS:</strong> So we need to ban the genealogy databases (laughter). That will break all the links and allow any piece of information be linked</p>
<p><strong>DH:</strong> If you break it up to enough pieces, each piece will be uninformative</p>
<p><strong>MG:</strong> There’s still the issue of outliers: you’re going to have outliers. Maximum income in a survey - you know who that is. Correlation - lot of these factoids have subtle correlation, can do de-identification. A few bits of information, some simple correlation</p>
<p><strong>DH:</strong> I disagree. Suppose there’s a position on the genome where only person has an A at this position. Suppose I publicize that only one person has an A at this position.</p>
<p><strong>LS:</strong> But what’s the usefulness of this for research - one single position? Once we get to the usefulness part, we run privacy risk.</p>
<p><strong>DH:</strong> Yes, when we link things. dbSNP is fine, no one argues it ruins our privacy. There are stats</p>
<p><strong>LC:</strong> There’s a lot of info they won’t put in dbSNP - because it becomes identification. These are the pieces that are important for research</p>
<p><strong>DH:</strong> once we establish world of anonymous facts - can have private exchange of links</p>
<p><strong>LC:</strong> are the barriers too high to establish that system</p>
<p><strong>DH:</strong> it’s all out there with UUID, everywhere. Whole protocol is based on keeping secure private key chains</p>
<p><strong>Q:</strong> (Schatz) <strong>Do you think it’s a problem that perception in popular press of accurate identification?</strong></p>
<p><strong>LS:</strong> issue in popular press is that the informative power was oversold during the Human Genome Project. They’re not happening at the rate people expect</p>
<p><strong>AC:</strong> Scope, sensitivity isn’t great. The problem will take care of itself </p>
<p><strong>DH:</strong> people tend to overestimate the impact in the 5-10 year range, underestimate in 20+ range</p>
<p><strong>LC:</strong> Global alliance, verification</p>
<p><strong>Q: How do we contain/quantify the privacy that is consented for? Can we come up with metrics that quantify 1) uniqueness 2) identifiability? Actuarial tables to find uniqueness?</strong></p>
<p><strong>DH:</strong> We need to come up with categories - this granularity if it’s anonymous is not identifiable in itself. Then, only think that’s private is linking to pieces. Don’t think it’s a matter of counting how many people have that type of value. We can make assessments where it’s granular enough. </p>
<p><strong>MG:</strong> Agree with DH, make a few observations: theory of information. Risk: relationship between amount of info leaked and amt of risk taken.<br>
When we talk about this information leakage, we’re talking about identifiability risk AND characterization risk. But don’t consent to having all your proclivities/characteristics unearthed over time.</p>
<p><strong>Q: Danger of privacy - when the first db gets hacked. Are we selling these databases as being secure to the public? Change legislation - can’t be discriminated against. This will eventually be leaked - add more security or lessen consequence of being identified.</strong></p>
<p><strong>JM:</strong> This is critical - legislation. If I can’t get health insurance because of BRCA mutation, it’s important. Can’t get a job because genetics are known. Some initial legislation has been passed, but it’s up to us to lay out what this will look like. Serious risks to people in terms of daily life.</p>
<p><strong>AC:</strong> everyone agrees we have to have the greatest legislative on people. But even if its passed, it will not be sufficient - discrimination still happens.</p>
<p><strong>Q: Whole genome will be cheap and accessible and non-scientists will be able to get this done. In that world would you be able to get a hair from someone and collect the data yourself, you can circumvent all these security issues.</strong></p>
<p><strong>LS:</strong> Real and scary scenario - happening in law enforcement. Suspects are routinely genotyped without their knowledge.</p>
<p><strong>JM:</strong> You should all watch the movie GATTACA - logical and scary extension of all of this.<br><br>
<a href="http://www.imdb.com/title/tt0119177/">http://www.imdb.com/title/tt0119177/</a></p>
<p><strong>LS:</strong> The federal databases of genotype data are extremely well thought out. # of SSRs is just enough to identify people to narrow down suspect pool, but not enough to pick out a single person in the whole US. But the state unregulated.</p>
<p><strong>MG:</strong> Privacy bias - genetics has a checkered history. Darwin, 1920, etc. Given that history it’s good to reflect on this future.</p>
<p><strong>Q: There are other communities that have faced this: Census department. They understand the benefit to provide data to researchers (summary statistics, de-identified sub-sets, experimenting with creating simulated sets where the probabilities of the data are mirrored and operated on). Can there be a parallel track where we start to experiment?</strong></p>
<p><strong>MG:</strong> Big Data is about data not simulations. Very doubtful if a simulation could recreate the linkage.</p>
<p><strong>DH:</strong> The linkage of all of our genomes is a product of our common heritage, as it gets dense we’ll reach a critical point where we can do a lot of inference. </p>
<p><strong>LC:</strong> If someone comes up with a Facebook for human genomes, way more useful </p>
<p><strong>LS:</strong> I once ran a thought experiment across wife’s relatives (south indians). If you have a cell phone app where you can search for all your relatives within a nth degree radius. "Oh yes, love it This would save so much time!” - Then I say, all you have to do is donate a bit of DNA - “sure!”</p>
<p><strong>JM:</strong> I look at my children and their friends - their notion of privacy is very different. Sharing their genomes would be a drop in the bucket.</p>
<p><strong>AC:</strong> Enough people would share so that if your genome data is linked they could identify you.</p>
<p><strong>MG:</strong> Then when you’re the only person left who cares about privacy, you’re identified as the one person who hasn’t shared.</p>
<p><strong>Q: We talked in circles around the legal issues - imagine the outcomes. Danger: if we don’t do this, we’ll end up in a dystopian situation where we can’t talk to each other.</strong></p>
<p><strong>Q: At a big data conference. Difficult to link these entities - that’s why we’re here, to make these links. How privacy affects the downstream. Should there be a consideration ‘upon your expiration your data is withdrawn’?</strong></p>
<p><strong>LC:</strong> I had 23andMe done. I discussed this with my parents, but not sisters.</p>
<p><strong>JM:</strong> Watsons personal genome published but not APOE status<br>
<a href="http://www.nature.com/ejhg/journal/v17/n2/full/ejhg2008198a.html">http://www.nature.com/ejhg/journal/v17/n2/full/ejhg2008198a.html</a></p>
<p><strong>MG:</strong> People were able to trivially determine his APOE status. </p>
<p><strong>JM:</strong> Concerning: people often don’t understand that a lot of these genetic variants are just a predisposition to certain endpoint. The kind of education that is required is huge - help people understand probabilistic.</p>
<p><strong>LS:</strong> Recently discussed Canada’s policy on withdrawal of genetic information. Proposed to allow 1st degree relatives to withdraw on demise. This was unworkable - what if siblings disagree?</p>
<p><strong>LC:</strong> We’re getting paternalistic - remember getting this data out and easy to use will be of such benefit to health and science. We shouldn’t put too many barriers up.</p>
<p><strong>DH:</strong> We owe it to our grandchildren to do our best to understand how genomes and disease are related</p>
<p><strong>MG:</strong> One thing that’s important - there’s a lot of countries that don’t care about privacy. Their legal system setup is not ready to worry about this. I can imagine a future where most of the genomics and discoveries are centred in places that don’t put up these barriers</p>
<p><strong>Q: We should license people to use big data analysis. Age of big data - privacy is an illusion. You can go to someone’s home and know everything about them. If someone wilfully wants to know you - it doesn’t cost that much.</strong></p>
<p><strong>LS:</strong> agree</p>
<p><strong>AC:</strong> it all comes down to how many people have access to data. We want to provide a technical solution robust enough share with research help cut some of that off.</p>
<p><strong>DW: What are some of the technical barriers in the next 5 years</strong></p>
<p><strong>LS:</strong> enabling people to get into cloud (or whatever) and use it. Accessible to as many people as possible in a secure manner</p>
<p><strong>JM:</strong> How do I find out what data is out there that’s relevant to my particular project/study. Better metadata. If I could find the sets I need - I don’t mind going to whoever owns them and get permission. There’s a lot of data that’s acquired that people don’t know about and it’s not described. Description, registry and search - without command line.</p>
<p><strong>AC:</strong> Making sure everyone is doing what they know how to do best. Bioinformaticians aren’t tied down doing things outside their expertise, biologists have access, researchers have access</p>
<p><strong>MG:</strong> Having lots of worked out exchange standards for secondary analysis files. Want to share reads/BAMs, but secondary (summarized) data sets are very useful. Very little standardization now. </p>
<p><strong>DH:</strong> technology moving so fast - have to be nimble. Have flexible standards/evolving. Up to speed to transfer/process/exchange data. APIs are important. Metadata is important. Require goodwill, work together to create standards. e.g. W3C - internet standards. Not easy.</p>
<p><strong>LS:</strong> analytic pipelines are complicated and finicky. Small changes get dramatically different results. Projects like Galaxy and synapse - keep track of steps of a workflow. Track the output/input files - human and machine readable and reproducible.</p>
<p><strong>DW: Any other points? Any prediction next 5-10 years</strong></p>
<p><strong>LS:</strong> In the next 10-15min, we’ll all enjoy a nice reception.</p>
<p><strong>MG:</strong> sports genomics and superstar genomics</p>
<p><strong>DH:</strong> I see turmoil and opportunity - research projects talking to each other at a large scale. Work with clinical world.</p>
<p><strong>JM:</strong> Great promise for translation. we’re doing better at identifying the genetic variants and signatures associated with disease. Beginning to make progress on mechanism. Treatment is a greater challenge - hopefully it will come. </p>
<p><strong>LS:</strong> The nature of the clinical trial is going to change - not just a single region/centre with 100 patients. Globally distributed clinical trials - networks of independent physicians. Patients with rare genetic variants enrolled. Precision genetics clinical trials.</p>
<p><strong>LC:</strong> Hope: we can start answering basic biological questions and providing clinical outcomes</p>
<p><strong>AC:</strong> Predict: tools will become more robust: Clinical applications - cancer will lead the way. Drug companies will combine genotype and phenotype data. The majority of sequencing will be cattle, plants ($2 a plant!)- humans are backwards.</p>
<h3 id="other-posts-in-this-series_3">Other posts in this series: <a class="head_anchor" href="#other-posts-in-this-series_3">#</a>
</h3>
<ul>
<li><a href="/big-data-in-biology">Big Data in Biology</a></li>
<li><a href="/big-data-in-biology-largescale-cancer-genomics">Big Data in Biology: Large-scale Cancer Genomics</a></li>
<li><a href="/big-data-in-biology-databases-and-clouds">Big Data in Biology: Databases and Clouds</a></li>
<li><a href="/big-data-in-biology-personal-genomes">Big Data in Biology: Personal Genomes</a></li>
<li><a href="/big-data-in-biology-imagingparmacogenomics">Big Data in Biology: Imaging/Parmacogenomics</a></li>
<li><a href="/big-data-in-biology-databases-and-clouds#schatz">Big Data in Biology: The Next 10 Years of Quantitative Biology</a></li>
</ul>
tag:blog.abigailcabunoc.com,2014:Post/big-data-in-biology-imagingparmacogenomics2014-04-01T06:05:41-07:002014-04-01T06:05:41-07:00Big Data in Biology: Imaging/Parmacogenomics<p>Series Introduction: I attended the <a href="http://www.keystonesymposia.org/14F2">Keystone Symposia Conference: Big Data in Biology</a> as the Conference Assistant last week. I set up an Etherpad during the meeting to take <a href="http://ksbigdata.titanpad.com/3">live notes</a> during the sessions. I’ve compiled all the abstracts, notes and slides (where available) here. Shout-out to <a href="https://twitter.com/dkuo">David Kuo</a> for helping edit the notes.</p>
<p><em>Warning: These notes are somewhat incomplete and mostly written in broken english</em></p>
<h3 id="other-posts-in-this-series_3">Other posts in this series: <a class="head_anchor" href="#other-posts-in-this-series_3">#</a>
</h3>
<ul>
<li><a href="/big-data-in-biology">Big Data in Biology</a></li>
<li><a href="/big-data-in-biology-largescale-cancer-genomics">Big Data in Biology: Large-scale Cancer Genomics</a></li>
<li><a href="/big-data-in-biology-databases-and-clouds">Big Data in Biology: Databases and Clouds</a></li>
<li><a href="/big-data-in-biology-big-data-challenges-and-solutions-control-access-to-individual-genomes">Big Data in Biology: Big Data Challenges and Solutions: Control Access to Individual Genomes</a></li>
<li><a href="/big-data-in-biology-personal-genomes">Big Data in Biology: Personal Genomes</a></li>
<li><a href="/big-data-in-biology-databases-and-clouds#schatz">Big Data in Biology: The Next 10 Years of Quantitative Biology</a></li>
</ul>
<h1 id="imagingparmacogenomics_1">Imaging/Parmacogenomics <a class="head_anchor" href="#imagingparmacogenomics_1">#</a>
</h1>
<p>Tuesday, March 25th, 2014 1:00pm - 3:00pm<br>
<a href="http://ks.eventmobi.com/14f2/agenda/35704/288362">http://ks.eventmobi.com/14f2/agenda/35704/288362</a></p>
<h2 id="a-nameschedulespeaker-lista_2">
<a name="schedule">Speaker list</a> <a class="head_anchor" href="#a-nameschedulespeaker-lista_2">#</a>
</h2>
<p><strong>Susan Sunkin</strong>, Allen Institute for Brain Science, USA<br><br>
<a href="#sunkin"><em>Allen Brain Atlas: An Integrated Neuroscience Resource</em></a> -<br>
[<a href="#sunkin-abstract">Abstract</a>]<br>
[<a href="#sunkin-qa">Q&A</a>]</p>
<p><strong>Jason R. Swedlow</strong>, University of Dundee, Scotland<br><br>
<a href="#swedlow"><em>The Open Microscopy Environment: Open Source Image Informatics for the Biological Sciences</em></a> -<br>
[<a href="#swedlow-abstract">Abstract</a>]<br>
[<a href="#swedlow-qa">Q&A</a>]</p>
<p><strong>Douglas P. W. Russell</strong>, University of Oxford, UK<br><br>
<a href="#russell"><em>Short Talk: Decentralizing Image Informatics</em></a> -<br>
[<a href="#russell-abstract">Abstract</a>]<br>
[<a href="#russell-qa">Q&A</a>]</p>
<p><strong>John Overington</strong>, European Molecular Biology Laboratory, UK<br><br>
<a href="#overington"><em>Spanning Molecular and Genomic Data in Drug Discovery</em></a> -<br>
[<a href="#overington-abstract">Abstract</a>]<br>
[<a href="#overington-qa">Q&A</a>]</p>
<hr>
<h2 id="a-namesunkinallen-brain-atlas-an-integrated-n_2">
<a name="sunkin">Allen Brain Atlas: An Integrated Neuroscience Resource</a> <a class="head_anchor" href="#a-namesunkinallen-brain-atlas-an-integrated-n_2">#</a>
</h2><h3 id="susan-sunkin-allen-institute-for-brain-scienc_3">Susan Sunkin, Allen Institute for Brain Science, USA <a class="head_anchor" href="#susan-sunkin-allen-institute-for-brain-scienc_3">#</a>
</h3><blockquote>
<h4 id="a-namesunkinabstractabstracta_4">
<a name="sunkin-abstract">Abstract</a> <a class="head_anchor" href="#a-namesunkinabstractabstracta_4">#</a>
</h4>
<p>The Allen Brain Atlas (<a href="http://www.brain-map.org">www.brain-map.org</a>) is a collection of open public resources (2 PB of raw data, >3,000,000 images) integrating high-resolution gene expression, structural connectivity, and neuroanatomical data with annotated brain structures, offering whole-brain and genome-wide coverage. The eight major resources currently available span across species (mouse, monkey and human) and development. In mouse, gene expression data covers the entire brain and spinal cord at multiple developmental time points through adult. Mouse data also includes brain-wide long-range axonal projections in the adult mouse as part of the Allen Mouse Brain Connectivity Atlas.</p>
<p>Complementing the mouse atlases, there are four human and non-human primate atlases. The Allen Human Brain Atlas, the NIH-funded BrainSpan Atlas of the Developing Human Brain, and the NIH Blueprint NHP Atlas contain genome-wide gene expression data (microarray and/or RNA sequencing) and high-resolution in situ hybridization (ISH) data for selected sets of genes and brain regions across human and non-human primate development and/or in adult. In addition, the Ben and Catherine Ivy Foundation-funded funded Ivy Glioblastoma Atlas Project contains gene expression data in human glioblastoma.</p>
<p>While the Allen Brain Atlas data portal serves as the entry point and enables searches across data sets, each atlas has its own web application and specialized search and visualization tools that maximize the scientific value of those data sets. Tools include gene searches; ISH image viewers and graphical displays; microarray and RNA sequencing data viewers; Brain Explorer® software for 3D navigation and visualization of gene expression, connectivity and anatomy; and an interactive reference atlas viewer. For the mouse, integrated search and visualization is through automated signal quantification and mapping to a common reference framework. In addition, cross data set searches enable users to query multiple Allen Brain Atlas data sets simultaneously.</p>
</blockquote><h4 id="notes_4">Notes <a class="head_anchor" href="#notes_4">#</a>
</h4>
<p>10 years of work and >200 ppl contribution.</p>
<h5 id="allen-institute-primarily-studying-mouse-amp_5">Allen Institute: primarily studying mouse & human <a class="head_anchor" href="#allen-institute-primarily-studying-mouse-amp_5">#</a>
</h5>
<ul>
<li>largest publicly available neuroscience resource</li>
<li>gene expression to connectivity, cell type and circuitry</li>
<li>RNA-Seq</li>
<li>generated in standardized manner then mapped to framework</li>
<li>generated 3PB of data</li>
<li>mouse brain atlas - mouse spinal cord, mouse developing, then human brain, human dev brain</li>
<li>all data accessed through data portal <a href="http://www.brain-map.org/">http://www.brain-map.org/</a>
</li>
</ul>
<h4 id="allen-mouse-brain-atlas_4">Allen mouse brain Atlas <a class="head_anchor" href="#allen-mouse-brain-atlas_4">#</a>
</h4>
<ul>
<li>genome wide cellular resolution atlas of gene expression in adult mouse brain - in situ hybridization</li>
<li>20K genes surveyed</li>
<li>informatics goals: aid search, navigation and visualization (make it easy to find what you’re looking for)</li>
</ul>
<p>informatics pipeline: broken down to</p>
<ul>
<li>preprossing</li>
<li>detection</li>
<li>alignment: mapped to 3d space- > where expression and how much in brain</li>
<li>griding</li>
<li>search</li>
<li>production - very product focused. Publicly available. Mine data and ask biological questions..
end with expression data matrix</li>
</ul>
<p>Tools to harness data generated from the pipeline</p>
<ul>
<li>3d viewing tool to view neuro-anatomy and 3d gene expression for one or multiple experiments</li>
<li>gene expression summaries</li>
<li>synchronization feature- same location different experiments</li>
<li>image tool etv- higher resolution image viewer. interactive 3d representation. probe and gene data available. histogram of expression energy.
nice snapshot of expression, decide if they’ll do a deeper dive in info</li>
<li>Reference atlas -
<ul>
<li>structure ontology</li>
<li>anontated reference atlas place</li>
<li>can look at experimental image and look up regions</li>
</ul>
</li>
<li>grid data search - users can search over 25K datasets to find genes with specific expresion pattern
<ul>
<li>
<em>differential search</em>: high expression in one set (target) compared to contrast</li>
<li>
<em>correlative search</em>: find genes with similar spatial expression profile</li>
</ul>
</li>
</ul>
<h4 id="developing-mouse-brain-atlas_4">Developing mouse brain atlas <a class="head_anchor" href="#developing-mouse-brain-atlas_4">#</a>
</h4>
<ul>
<li>build on allen mouse brain atlas</li>
<li>pick genes for neural development</li>
<li>use reference atlas</li>
<li>create of 3d and 4d tools and data analysis</li>
<li>high qualitiy specimens selected, stained, generate images, annotate regions, make 2d and 3d output (adobe illustrator)</li>
<li>Search and analysis tools - pick 2d images and get extrapolated 3d expression</li>
<li>Imaging synchronization feature - variety of transcription factor targets
<ul>
<li>select location as seed object</li>
<li>will snychronize all the images you are looking at to the same location</li>
</ul>
</li>
</ul>
<p>Allen mouse connectivity atlas</p>
<ul>
<li>high res map of neural connections in whole mouse brain.
generate comprehensive db of neural projections.
generate 140images per specimen at 100 micron intervals</li>
<li>one mouse brain after injected is embedded and placeed on stage two photon images taken, then brain moved over and section slice taken off. then another image taken.
block face imaging throughout the entire brain</li>
<li>looking at fluorescent projects</li>
<li>spacially map brain to 3d reference model</li>
<li>comprehensive coverage for projection mapping - wt mouse but interested in cell type.
projection profiling with cree-driver (sp?) mice</li>
<li>can look at trajectory and topography</li>
</ul>
<p>Other tools - brain wide data - can pin point region of interest adn dive deeper</p>
<h4 id="allen-human-brain-atlas_4">Allen Human Brain Atlas <a class="head_anchor" href="#allen-human-brain-atlas_4">#</a>
</h4>
<ul>
<li>all genes - all structures. classical histology and neuroanatomy</li>
<li>cellular resolution data - scale. only looked at a subset of genes on a subset of structures (very question driven, autism, schizophrenia, etc)</li>
<li>not possible to process whole genome brain. generate large slabs - create a jigsaw puzzle and assemble at the end</li>
<li>generate histology data, neuranatomical regions of interest generated</li>
<li>LIMS system to assemble the puzzle</li>
<li>structural ontology - to generate summary stats</li>
<li>Search: search by gene or structure, neuroblast correlative search, differential serach</li>
<li>3D brain explorer</li>
<li>Tissue acquisition processing. postmortem brains. no neuropsychiactric disorder</li>
<li>MR Registration volume renderings: rigid and non-rigit registering had to be done</li>
<li>tissue sampling: slabs partitioned, sectioned and map back in MR space</li>
<li>tissue block to MR Registration: place landmarks on scans matched with corresponding image in 3d space</li>
</ul>
<h4 id="developing-human-brain-project_4">Developing human Brain project <a class="head_anchor" href="#developing-human-brain-project_4">#</a>
</h4>
<p>four main components</p>
<ol>
<li>developmental transcriptome</li>
<li>prenatal microarray: hi res, 300 distinct structures</li>
<li>ISH: just a subset of regions/genes</li>
<li>reference atlases: few generated for this project (prenatal and adults), include histology and imaging data</li>
</ol>
<p>Prenatal - LMD Microarray Data</p>
<ul>
<li>fresh tissue frozen and slabbed</li>
<li>histology determines regions of interest</li>
<li>sent for hybridization to Agilent microarrays.
same as adult data for x-comparison</li>
<li>display with online tool:
anatomical view and heat map view</li>
</ul>
<h4 id="a-namesunkinqaq-amp-aa_4">
<a name="sunkin-qa">Q & A</a> <a class="head_anchor" href="#a-namesunkinqaq-amp-aa_4">#</a>
</h4>
<p><strong>Q:</strong> (Stein) <strong>interested in how labour intensive human tissue blocks were- were the markers placed by hand?</strong></p>
<p><strong>A:</strong> Not for every Z level of the MRI, but yes labour intensive. Many steps in order to use the automated pipeline.</p>
<p><strong>Q:</strong> (Schatz) <strong>at CSHL big study in exome sequencing - which of these genes are expressed in brain at various levels of development?</strong></p>
<p><strong>A:</strong> Use our API to pull out data from different datasets to produce that.</p>
<p><strong>Q: Different imaging methods and approaches - what’s the Allen’s approach to presenting the information in some way that could be queries at different domains and at the cell level?</strong></p>
<p><strong>A:</strong> The level of registration is not down to cell - it’s domains.</p>
<p><a href="#schedule">back to the speaker list →</a></p>
<hr>
<h2 id="a-nameswedlowthe-open-microscopy-environment_2">
<a name="swedlow">The Open Microscopy Environment: Open Source Image Informatics for the Biological Sciences</a> <a class="head_anchor" href="#a-nameswedlowthe-open-microscopy-environment_2">#</a>
</h2><h3 id="jason-r-swedlow-university-of-dundee-scotland_3">Jason R. Swedlow, University of Dundee, Scotland <a class="head_anchor" href="#jason-r-swedlow-university-of-dundee-scotland_3">#</a>
</h3><blockquote>
<h4 id="a-nameswedlowabstractabstracta_4">
<a name="swedlow-abstract">Abstract</a> <a class="head_anchor" href="#a-nameswedlowabstractabstracta_4">#</a>
</h4>
<p>Despite significant advances in cell and tissue imaging instrumentation and analysis algorithms, major informatics challenges remain unsolved: file formats are proprietary, facilities to store, analyze and query numerical data or analysis results are not routinely available, integration of new algorithms into proprietary packages is difficult at best, and standards for sharing image data and results are lacking. We have developed an open-source software framework to address these limitations called the Open Microscopy Environment (<a href="http://openmicroscopy.org">http://openmicroscopy.org</a>). OME has three components—an open data model for biological imaging, standardised file formats and software libraries for data file conversion and software tools for image data management and analysis.</p>
<p>The OME Data Model (<a href="http://openmicroscopy.org/site/support/ome-model/">http://openmicroscopy.org/site/support/ome-model/</a>) provides a common specification for scientific image data and has recently been updated to more fully support fluorescence filter sets, the requirement for unique identifiers, screening experiments using multi-well plates.</p>
<p>The OME-TIFF file format (<a href="http://openmicroscopy.org/site/support/ome-model/ome-tiff">http://openmicroscopy.org/site/support/ome-model/ome-tiff</a>) and the Bio-Formats file format library (<a href="http://openmicroscopy.org/site/products/bio-formats">http://openmicroscopy.org/site/products/bio-formats</a>) provide an easy-to-use set of tools for converting data from proprietary file formats. These resources enable access to data by different processing and visualization applications, sharing of data between scientific collaborators and interoperability in third party tools like Fiji/ImageJ. </p>
<p>The Java-based OMERO platform (<a href="http://openmicroscopy.org/site/products/omero">http://openmicroscopy.org/site/products/omero</a>) includes server and client applications that combine an image metadata database, a binary image data repository and visualization and analysis by remote access. The current stable release of OMERO (OMERO-4.4; <a href="http://openmicroscopy.org/site/support/omero4/downloads">http://openmicroscopy.org/site/support/omero4/downloads</a>) includes a single mechanism for accessing image data of all types– regardless of original file format– via Java, C/C++ and Python and a variety of applications and environments (e.g., ImageJ, Matlab and CellProfiler). This version of OMERO includes a number of new functions, including SSL-based secure access, distributed compute facility, filesystem access for OMERO clients, and a scripting facility for image processing. An open script repository allows users to share scripts with one another. A permissions system controls access to data within OMERO and enables sharing of data with users in a specific group or even publishing of image data to the worldwide community. Several applications that use OMERO are now released by the OME Consortium, including a FLIM analysis module, an object tracking module, two image-based search applications, an automatic image taggi</p>
</blockquote><h4 id="notes_4">Notes <a class="head_anchor" href="#notes_4">#</a>
</h4>
<p>Representing consortium of 10 different groups US, UK, Europe<br>
Outline: </p>
<ul>
<li>Problem, </li>
<li>2 possible solutions, </li>
<li>sharing and publishing data, </li>
<li>directions, </li>
<li>imaging community, </li>
<li>publishing large imaging datasets</li>
</ul>
<h4 id="problem_4">Problem <a class="head_anchor" href="#problem_4">#</a>
</h4>
<ul>
<li>image: cancer cell preparing to divide in mitosis.</li>
<li>In the early days, taking such an image was a big deal - huge improvement. detectors and computation power.</li>
<li>we take these images and work hard to get them on journal covers</li>
</ul>
<p>BUT - the most important thing to understand: </p>
<ul>
<li><strong>every one of these pixels is a quantitative measurement</strong></li>
<li>this is a temporally resolved measurement</li>
<li>easy to generate 50G of data in an afternoon.
biologists are enterprise data generators</li>
<li>trying to use these images as measurement.
this data should be a resource - collaboration, release the data to the community</li>
<li>the image problem is ubiquitous, electron microscopy, physiology, cells, in vivo, pathology, and more ->
all major enterprise data generators</li>
<li>the scientists that use these technologies are not data scientists.
they need these kinds of technologies and have ambition to make measurements at scale, but not tools</li>
</ul>
<h4 id="2-possible-solutions_4">2 Possible Solutions <a class="head_anchor" href="#2-possible-solutions_4">#</a>
</h4>
<ul>
<li>aspire to build solutions that address all these domains</li>
</ul>
<p>OME - towards image informatics</p>
<ul>
<li>do not create new imaging tools, visualization</li>
<li>all about interoperability:
<ul>
<li>some new imaging modality is developed and can be accessed by existing tools</li>
<li>new method for image analysis can be run on existing modalities</li>
<li>modalities are changing so quickly - standards are useless</li>
<li>no matter what’s coming off this imaging system, some tool will be able to interact</li>
</ul>
</li>
</ul>
<p>OME - founded over lunch w/ cell bio</p>
<ul>
<li>well plates becoming popular</li>
<li>people making microscope, chemical libraries and cell line -> no one is doing anythign about the data coming off</li>
<li>partner with other institutions - open source work (GPL license)</li>
<li>public road mapping, GitHub, continuous integration, Kanban</li>
<li>release:
<ul>
<li>specification for data - OME-TIFF - open image data file</li>
<li>bio-formats</li>
<li>Omero, image-data management platform</li>
</ul>
</li>
</ul>
<p>Open data formats: spend time worry about OME data model (xml based specs for datatypes). <br>
Around image acquisition events itself: model status of detector, lens, etc</p>
<p>Bio-Formats</p>
<ul>
<li>simple and tedious: reverse engineer proprietary formats, java lib, read each one convert to common model</li>
<li>doing this for 10 years</li>
<li>we get data from the community</li>
<li>best collection of imaging files in the world:
don’t have facilities to do anything other than hold this privately</li>
<li>installed 65K sites worldwide</li>
<li>2 FTEs working on this project</li>
<li>standardize interface to all formats</li>
</ul>
<h6 id="omero_6">OMERO <a class="head_anchor" href="#omero_6">#</a>
</h6>
<ul>
<li>clients on top, servers on bottom</li>
<li>storage on images - relational for metadata: HDF5 based structure</li>
<li>text search</li>
<li>building Omero - solve a problem in a lab at ian institute repository, journal, national repo</li>
<li>idea is that we have to support as many client architectures as we can.
Ice - middleware, used by Skype, great for large data graphs/binary data
<a href="http://www.zeroc.com/ice.html">http://www.zeroc.com/ice.html</a>
</li>
<li>rich java client- Omero insight
<ul>
<li>tree based files, thumbnail, region views</li>
<li>client-server architecture - 300G of data viewed across the wire</li>
<li>remote-access</li>
</ul>
</li>
<li>web based view (x-platform)</li>
<li>high content assays - modelled in data model</li>
<li>digital pathology - tile based viewer.
web based and java based on same api</li>
</ul>
<p>results</p>
<ol>
<li>treat result outputs as an annotation,</li>
<li>text based indexed with Luecene</li>
<li>large tabular results - relational HDF5</li>
</ol>
<p><em>// accidentally closed my browser…//</em></p>
<h4 id="sharing-and-publishing-data_4">Sharing and Publishing data <a class="head_anchor" href="#sharing-and-publishing-data_4">#</a>
</h4>
<ul>
<li>sharing data: e.g. lab web page, few lines of js, embed viewer</li>
<li>institutional repo:
publish paper, release data based on Omero based system</li>
<li>public resources:
compiling dynamic data</li>
<li>PDB, EMDataBank - publishing with OMERO</li>
</ul>
<h4 id="directions_4">Directions <a class="head_anchor" href="#directions_4">#</a>
</h4>
<p>how do we build an application that can work in a rapidly changing field like imaging?</p>
<ul>
<li>leverage the OME model</li>
<li>meta-compute - </li>
</ul>
<p>example: using Galaxy, clinical data set - need a metadata management system</p>
<ul>
<li>uses Omero underneath to store metadata</li>
<li>problem: every time there’s a new gene release needs to recalculate.
change datamodel to handle metadata.
also used Omero for histological images</li>
</ul>
<p>Uses of Omero</p>
<ul>
<li>Omero and ImageJ - plugins</li>
<li>MATLAB and Omero</li>
<li>Omero & u-track (custom object tracking software - MATLAB based)</li>
<li>Omero & FLIMfit - fluorescence lifetime</li>
<li>Omero.searcher</li>
<li>Omero & auto-tagging
<ul>
<li>user trying to access data - scan data and pick up tags</li>
<li>figure: when we submit figure to journals, wrestle with adobe illustrator</li>
<li>always remove from original data structure and create jpeg - loose orignal context</li>
<li>js based viewer - to keep linkage between representation of data and data itself.
figure = js / not tiff</li>
</ul>
</li>
<li>Omero and bioformats
<ul>
<li>data import and access</li>
<li>digital pathology and hi input screenings</li>
<li>data will be written once (at multi TB scales) use Omero and pull image off directly - don’t copy data</li>
</ul>
</li>
</ul>
<h4 id="imaging-community_4">Imaging Community <a class="head_anchor" href="#imaging-community_4">#</a>
</h4>
<ul>
<li>Annual user meetings</li>
<li>active community of open source projects</li>
<li>working towards progress</li>
</ul>
<h4 id="publishing-large-imaging-datasets_4">Publishing Large Imaging Datasets <a class="head_anchor" href="#publishing-large-imaging-datasets_4">#</a>
</h4>
<p>publishing image to data: Perkin-Elmer’s columbus - Omero in a box</p>
<p>journal of cell bio - built JCB viewer - js </p>
<ul>
<li>large image data</li>
<li>digital pathology to scale</li>
</ul>
<p>phenotypic screening - hi content screens</p>
<ul>
<li>many TB of data</li>
<li>published data, all authors call, genomic information</li>
<li>authors listed free text of phenotypes they saw</li>
<li>cell phenotype database @ EBI
<ul>
<li>combines all publish hi content screens</li>
<li>take manual author annotations</li>
<li>create ontology: common way to annotate this data</li>
</ul>
</li>
</ul>
<p>More datatypes, more storage, more analysis</p>
<h4 id="a-nameswedlowqaq-amp-aa_4">
<a name="swedlow-qa">Q & A</a> <a class="head_anchor" href="#a-nameswedlowqaq-amp-aa_4">#</a>
</h4>
<p><strong>Q:</strong> (Schatz) <strong>a number of the image formats are copywrited, etc. What is your experience as you reverse engineer these formats? Legal problems?</strong></p>
<p><strong>A:</strong> Almost every commercial vendor, when they build a new imaging system they build a new image format. Just changing now. In general, if you look at the end user license - it will forbid you from reverse engineering. Does not forbid you uploading to us and we reverse engineer it. That’s what we do. Last few years - vendors coming to us - please make sure that this file format is support on the date that we release it. Sometimes they take our metadata specs and drop it into theirs. A lot is opening up and ppl are more willing to work with us.</p>
<p><strong>Q: From a CS lab that does open source dev: you said you release everything GPL. We release everything apache - a lot of people in industry like it better. Why choose GPL? Feedback?</strong></p>
<p><strong>A:</strong> Short version: when we started, there wasn’t the richness is licenses. To be blunt, we want people to contribute. As the guy who has to pay an enourmous number of salaries, we’re fine when a company wants to use our software, but we need some way to keep the project going and feed everyone. We get a licensing fee from perkinelmer (closed) to help development.</p>
<p><a href="#schedule">back to the speaker list →</a></p>
<hr>
<h2 id="a-namerussellshort-talk-decentralizing-image_2">
<a name="russell">Short Talk: Decentralizing Image Informatics</a> <a class="head_anchor" href="#a-namerussellshort-talk-decentralizing-image_2">#</a>
</h2><h3 id="douglas-pw-russell-university-of-oxford-uk_3">Douglas P.W. Russell, University of Oxford, UK <a class="head_anchor" href="#douglas-pw-russell-university-of-oxford-uk_3">#</a>
</h3>
<p>department of biochemistry<br>
member of open microscopy consortium</p>
<blockquote>
<h4 id="a-namerussellabstractabstracta_4">
<a name="russell-abstract">Abstract</a> <a class="head_anchor" href="#a-namerussellabstractabstracta_4">#</a>
</h4>
<p>The Open Microscopy Environment (OME; <a href="http://openmicroscopy.org">http://openmicroscopy.org</a>) builds software tools that facilitate image informatics. An open file format (OME-TIFF) and software library (Bio-Formats) enable the free access to multidimensional (5D+) image data regardless of software or platform. A data management server (OMERO) provides an image data management solution for labs and institutes by centralizing the storage of image data and providing the biologist a means to manage that data remotely through a multi-platform API. This is made possible by the Bio-Formats library, extracting image metadata into a PostgreSQL database for fast lookup, and multi-zoom image previews enable visual inspection without the cost of transmitting the actual raw data to the user. In addition to the convenience for individual biologists, sharing data with collaborators becomes simpler and avoids data duplication.</p>
<p>Addressing the next scale of data challenges, e.g. at the national or international level, has brought the OME platform up against some hard barriers. Already, the data output of individual imaging systems has grown to the multi-TB level. Integrating multi-TB datasets from dispersed locations, and integrating analysis workflows will soon challenge the basic assumptions that underly a system like OMERO. This is particularly true for automated processing: OMERO.scripts provides a facility for running executables in the locality of the data. The use of ZeroC’s IceGrid permits farming out such tasks in Python, C++, Java, and in OMERO5 even ImageJ2 tasks to nodes which all use the same remote API. However, OMERO does not yet provide a solution for decentralised data and workflow management. </p>
<p>A logical next step for OMERO is to decentralize the data by increasing the proximity of data storage to processing resources, reducing bottlenecks through redundancy, and enabling vast data storage on commodity hardware rather than expensive, enterprise storage.</p>
</blockquote><h4 id="notes_4">Notes <a class="head_anchor" href="#notes_4">#</a>
</h4><h5 id="how-omero-can-scale-with-big-data-higher-dema_5">How OMERO can scale with big data, higher demand <a class="head_anchor" href="#how-omero-can-scale-with-big-data-higher-dema_5">#</a>
</h5>
<p>1) as scope and # of users increase, total data increases</p>
<ul>
<li>one end: 1 user or small group of users</li>
<li>a user with minimal amt of sysadmin can instal and get it working</li>
<li>other end: national resources, institute: need a serious sysadmin team</li>
<li>tradeoffs: </li>
</ul>
<p>2) Data set size: hight content screen</p>
<ul>
<li>many images, each well, many dimensions</li>
<li>phenotypic data attached to each well</li>
<li>links to external genomic resources</li>
<li>all of this is a huge amount of data. One screen can be TBs in size</li>
</ul>
<p>Once data is in OMERO - excellent data management tool</p>
<ul>
<li>until you get it in there - need to make choices on how to put it in</li>
<li>smaller scale: input data and archive original image. extract metadata for search</li>
<li>when analysis needs pixel data - extract at runtime</li>
<li>in reality - users need access to filesystem where raw data is. Moving data around is unfeasible.
now, extract metadata and reference to where raw file is.
helps with data duplication problem</li>
<li>preferable to store data in read optimized format.
trade some operation efficiency for some possible data loss</li>
</ul>
<h5 id="omero-services_5">OMERO services <a class="head_anchor" href="#omero-services_5">#</a>
</h5>
<ul>
<li>all run on Ice</li>
<li><a href="http://www.zeroc.com/ice.html">http://www.zeroc.com/ice.html</a></li>
<li>process, indexer, and more - all on ice</li>
</ul>
<p>Ice gives us the capability to distribute some services to other hosts </p>
<ul>
<li>pretty seamless - can take advantage of local compute</li>
<li>can do this multiple times to access more compute resources</li>
<li>but then each has to communicate back to original</li>
<li>>> decentralizing omero</li>
</ul>
<p>Decentralized</p>
<ul>
<li>access data directly - both servers can access resources (filesystem) directly</li>
<li>once we have that, we can scale - more servers</li>
<li>this has the potential to address image management on scaled level</li>
<li>can deploy man Omero components on many hosts - make more powerful, absorb volumes of data</li>
<li>can take advantage of cloud computing - can scale permanent or temporarily - spin off more hosts</li>
<li>will be necessary to augment Omero’s resources with distributed filesystems - store huge amounts of pixel or image data</li>
<li>can also make use of Cassandra clusters - caching frequently accessed data.
much bigger scale</li>
</ul>
<p>That how we’d like to cope with big data in Omero but make it accessible for single user who wants to install it locally</p>
<p><a href="https://github.com/openmicroscopy">github.com/openmicroscopy</a></p>
<h4 id="a-namerussellqaq-amp-aa_4">
<a name="russell-qa">Q & A</a> <a class="head_anchor" href="#a-namerussellqaq-amp-aa_4">#</a>
</h4>
<p><strong>Q:</strong> (Schatz) <strong>are you considering map-reduce or just storage?</strong></p>
<p><strong>A:</strong> we could definitely use them both, yes</p>
<p><a href="#schedule">back to the speaker list →</a></p>
<hr>
<h2 id="a-nameoveringtonspanning-molecular-and-genomi_2">
<a name="overington">Spanning Molecular and Genomic Data in Drug Discovery</a> <a class="head_anchor" href="#a-nameoveringtonspanning-molecular-and-genomi_2">#</a>
</h2><h3 id="john-overington-european-molecular-biology-la_3">John Overington, European Molecular Biology Laboratory, UK <a class="head_anchor" href="#john-overington-european-molecular-biology-la_3">#</a>
</h3><blockquote>
<h4 id="a-nameoveringtonabstractabstracta_4">
<a name="overington-abstract">Abstract</a> <a class="head_anchor" href="#a-nameoveringtonabstractabstracta_4">#</a>
</h4>
<p>The link between biological and chemical worlds is of critical importance in many fields, not least that of healthcare and chemical safety assessment. A major focus in the integrative understanding of biology are genes/proteins and the networks and pathways describing their interactions and functions; similarly, within chemistry there is much interest in efficiently identifying drug-like, cell-penetrant compounds that specifically interact with and modulate these targets. The number of genes of interest is of the range of 105 to 106, which is modest with respect to plausible drug-like chemical space - 1020 to 1060. We have built a public database linking chemical structures (106) to molecular targets (104), covering molecular interactions and pharmacological activities and Absorption, Distribution, Metabolism and Excretion (ADME) properties (<a href="http://www.ebi.ac.uk/chembl">http://www.ebi.ac.uk/chembl</a>) in an attempt to map the general features of molecular properties and features important for both small molecule and protein targets in drug discovery. We have then used this empirical kernel of data to extend analysis across the human genome, and to large virtual databases of compound structures - we have also integrated these data with genomics datasets, such as the GWAS catalogue.</p>
</blockquote><h4 id="notes_4">Notes <a class="head_anchor" href="#notes_4">#</a>
</h4>
<p>Chemistry. Mapping of Chemistry - interface of chemistry with genomic and drug discovery data.</p>
<h5 id="background_5">Background <a class="head_anchor" href="#background_5">#</a>
</h5>
<p>chemical space: how big is the chemical space. GBD-13 - all possible molecules (stable) with up to 13 heavy atoms</p>
<ul>
<li>1B structures</li>
<li>largest small organic databases</li>
<li>GDB-17 - 166B structures - not available. Intellectual property issues</li>
</ul>
<p>not all molecules can be drugs - needs to be bioactive</p>
<ul>
<li>physical properties access to ‘target’</li>
<li>ADMET - absorption, distribution, metabolism, excretion & toxicity</li>
</ul>
<p>Lipinski - a molecule given these parameters was likely to have good oral drug prop. <a href="http://en.wikipedia.org/wiki/Lipinski's_rule_of_five">http://en.wikipedia.org/wiki/Lipinski’s_rule_of_five</a></p>
<ul>
<li>different for topical and parenterally dosed drugs</li>
<li>pretty good guide</li>
</ul>
<p>10<sup>19-23</sup> libpinski like small molecules - potential drugs</p>
<p>around 21-23 peak in curve - size of heavy atom counts for drugs.<br>
drug discovery - making molecules slightly larger than they need to be</p>
<p>GDB - 30% of all known drugs ?</p>
<p>Targets : homo sapients 21K genes.<br>
Only 1% of genome is a drug target - we’ve been able to develop drugs against.<br>
we’ve tried many many more</p>
<h5 id="chemogenomics-chemistry-genome-derived-object_5">Chemogenomics = chemistry + genome derived objects <a class="head_anchor" href="#chemogenomics-chemistry-genome-derived-object_5">#</a>
</h5>
<ul>
<li>exploration of small molecule bioactiviy space at genomic scale</li>
<li>possible space: 10<sup>6</sup> (targets), drug target proteins 10<sup>2</sup>
</li>
<li>drugs: all reasonable 10<sup>22,</sup> screened: 10<sup>7</sup>
</li>
<li>similar compound structures have similar functions</li>
</ul>
<h5 id="chembl-training-set-largest-db-of-medicinal-c_5">ChEMBL - training set; largest db of medicinal chemistry data 1.4M compounds <a class="head_anchor" href="#chembl-training-set-largest-db-of-medicinal-c_5">#</a>
</h5>
<ul>
<li>adding plant data later this year</li>
<li>open</li>
<li>download/access - db dumps, semantic web rdf - SPARQL, virtualization (ChEMBL appliances)</li>
<li>ChEMpi - raspberry pi</li>
<li>data comes from the literature - extract structures fromteh text, link to assays, link to sequence, store functional data.
allows to chain targets to phenotypic effects</li>
<li>quantitative data</li>
<li>target types: single gene - all the way to - organisms</li>
<li>compound searching - matching structure space (2D blast)</li>
</ul>
<p>different drug structures - ligand efficiency</p>
<ul>
<li>drugs are efficient, every atom counts - avoid lipophilicity</li>
<li>interested in balance between binding efficiency and molecular size</li>
<li>target class data</li>
</ul>
<p>assay organism data</p>
<ul>
<li>differences between animal model and the effects of compounds in humans</li>
<li>failure in pre-clinic - works in animal models, but not humans</li>
<li>trying to understand systematic reasons</li>
</ul>
<p>SureChEMBL - acquired SureChem</p>
<ul>
<li>new public chemistry</li>
<li>extends coverage of chemical structures from full-text patent 15M structures</li>
<li>add target, sequences, disease, animal model, cell-line</li>
</ul>
<p>Compound Integration</p>
<ul>
<li>ChEMBL - literature</li>
<li>SureChEMBL- patent</li>
</ul>
<p>Different Types of Drugs</p>
<ul>
<li>2/3 drugs are small molecules</li>
<li>in late stage development - majority are small molecules</li>
<li>Therefore, focus on small molecules for drug discovery</li>
</ul>
<h5 id="visualizations_5">Visualizations <a class="head_anchor" href="#visualizations_5">#</a>
</h5>
<ul>
<li>Polypharmacology via binding sites:
majority of pharmacological activity focused on brain</li>
<li>Affinity of drugs for ‘Targets’:
drugs are weaker than we think - penalty for tight binding drugs</li>
<li>Clinical Candidates:
coverage of clinical development candidates - </li>
<li>Selectivity - circos plot:
map promiscuity across tree</li>
</ul>
<h5 id="pharma-productivity-problem_5">Pharma Productivity problem <a class="head_anchor" href="#pharma-productivity-problem_5">#</a>
</h5>
<ul>
<li>biotech boom</li>
<li>productivity has fallen off the cliff</li>
</ul>
<p>how many compounds does a company need to make before they develop a compound</p>
<ul>
<li>100K compounds synthesized to develop drug</li>
<li>now 32x that to get a potential drug</li>
<li>Now: pharma needs an average to synth and test 250K compounds for each launched drug.
not sustainable</li>
<li>Trying to be smarter, use db, to help with this</li>
</ul>
<h5 id="cancer-drugs-and-targets_5">Cancer Drugs and Targets <a class="head_anchor" href="#cancer-drugs-and-targets_5">#</a>
</h5>
<ul>
<li>taking ChEMBL and thinking of drug discovery in a cancer setting</li>
<li>huge investment in genomic studies looking for genomic variation - causes of cancer.
sequencing, find driver genes, look at other datasets, find overlaps</li>
</ul>
<p>come out with option of potential targets</p>
<ul>
<li>how do you select from these?</li>
<li>we can compare against things we had in the past</li>
<li>majority of the success from the past we would not have discovered using genomic sequencing techniques</li>
<li>canSAR - large scale integration of public and propriety data built on top of ChEMBL - select compounds likely to be good <a href="https://cansar.icr.ac.uk/">https://cansar.icr.ac.uk/</a>
</li>
</ul>
<h4 id="a-nameoveringtonqaq-amp-aa_4">
<a name="overington-qa">Q & A</a> <a class="head_anchor" href="#a-nameoveringtonqaq-amp-aa_4">#</a>
</h4>
<p><strong>Q:</strong> (Ouellette) <strong>finding out the chemical structures of various organisms; What about Micro-biome space?</strong></p>
<p><strong>A:</strong> Different animals have got different physical space for drugs they like. Controversy in literature - physical space for antibiotics. Micro-biome - fascinating - orally, also bacteria and guts. Effect of microbiome by gut bacteria - sometimes needed to activate substance</p>
<p><strong>Q:</strong> (Stein) <strong>Curious about 1B+ compounds in GBD-17. Can’t release because of IP? Algorithm or structures?</strong></p>
<p><strong>A:</strong> Just too big. Drug discovery community - <br>
Can publish the structures of all possible drugs=> can’t patent that - so will destroy all possible intellecual property.</p>
<p><strong>Q: For compounds w/ rich sequence information (transcriptome wide/proteomic) is it integrated?</strong></p>
<p><strong>A:</strong> yes and no, transcript microarray data goes in GO or express. Links - compounds in ChEMBL. Reality - very small numbers right now. ChEMBL part of a suite of resources at EBI, link to other resources.</p>
<p><strong>Q: Is there a way through ChEMBL to discover drugs that are potentially synergistic? Drugs with same structures and hit same targets. Connectivity map? X-ref between ChEMBL and connectivity map?</strong></p>
<p><strong>A:</strong> One of the most common uses of ChEMBL. combine drugs against the same targets. No links to connectivity map - people have done that.</p>
<p><a href="#schedule">back to the speaker list →</a></p>
<h3 id="other-posts-in-this-series_3">Other posts in this series: <a class="head_anchor" href="#other-posts-in-this-series_3">#</a>
</h3>
<ul>
<li><a href="/big-data-in-biology">Big Data in Biology</a></li>
<li><a href="/big-data-in-biology-largescale-cancer-genomics">Big Data in Biology: Large-scale Cancer Genomics</a></li>
<li><a href="/big-data-in-biology-databases-and-clouds">Big Data in Biology: Databases and Clouds</a></li>
<li><a href="/big-data-in-biology-big-data-challenges-and-solutions-control-access-to-individual-genomes">Big Data in Biology: Big Data Challenges and Solutions: Control Access to Individual Genomes</a></li>
<li><a href="/big-data-in-biology-personal-genomes">Big Data in Biology: Personal Genomes</a></li>
<li><a href="/big-data-in-biology-databases-and-clouds#schatz">Big Data in Biology: The Next 10 Years of Quantitative Biology</a></li>
</ul>
tag:blog.abigailcabunoc.com,2014:Post/big-data-in-biology-personal-genomes2014-04-01T06:05:22-07:002014-04-01T06:05:22-07:00Big Data in Biology: Personal Genomes<p>Series Introduction: I attended the <a href="http://www.keystonesymposia.org/14F2">Keystone Symposia Conference: Big Data in Biology</a> as the Conference Assistant last week. I set up an Etherpad during the meeting to take <a href="http://ksbigdata.titanpad.com/3">live notes</a> during the sessions. I’ve compiled all the abstracts, notes and slides (where available) here. Shout-out to <a href="https://twitter.com/dkuo">David Kuo</a> for helping edit the notes.</p>
<p><em>Warning: These notes are somewhat incomplete and mostly written in broken english</em></p>
<h3 id="other-posts-in-this-series_3">Other posts in this series: <a class="head_anchor" href="#other-posts-in-this-series_3">#</a>
</h3>
<ul>
<li><a href="/big-data-in-biology">Big Data in Biology</a></li>
<li><a href="/big-data-in-biology-largescale-cancer-genomics">Big Data in Biology: Large-scale Cancer Genomics</a></li>
<li><a href="/big-data-in-biology-databases-and-clouds">Big Data in Biology: Databases and Clouds</a></li>
<li><a href="/big-data-in-biology-big-data-challenges-and-solutions-control-access-to-individual-genomes">Big Data in Biology: Big Data Challenges and Solutions: Control Access to Individual Genomes</a></li>
<li><a href="/big-data-in-biology-imagingparmacogenomics">Big Data in Biology: Imaging/Parmacogenomics</a></li>
<li><a href="/big-data-in-biology-databases-and-clouds#schatz">Big Data in Biology: The Next 10 Years of Quantitative Biology</a></li>
</ul>
<h1 id="personal-genomes_1">Personal Genomes <a class="head_anchor" href="#personal-genomes_1">#</a>
</h1>
<p>Tuesday, March 25th, 2014 8:30am - 12:00pm<br><br>
<a href="http://ks.eventmobi.com/14f2/agenda/35704/288359">http://ks.eventmobi.com/14f2/agenda/35704/288359</a></p>
<h2 id="a-nameschedulespeaker-lista_2">
<a name="schedule">Speaker list</a> <a class="head_anchor" href="#a-nameschedulespeaker-lista_2">#</a>
</h2>
<p><strong>Lincoln D. Stein</strong>, Ontario Institute for Cancer Research, Canada<br><br>
<a href="#stein"><em>The International Cancer Genome Consortium Database</em></a> -<br>
[<a href="#stein-abstract">Abstract</a>]<br>
[<a href="#stein-qa">Q&A</a>]</p>
<p><strong>Ajay Royyuru</strong>, IBM T.J. Watson Research Center, USA<br><br>
<a href="#royyuru"><em>Genome Analytics with IBM Watson</em></a> -<br>
[<a href="#royyuru-abstract">Abstract</a>]<br>
[<a href="#royyuru-qa">Q&A</a>]</p>
<p><strong>Mark Gerstein</strong>, Yale University, USA<br><br>
<a href="#gerstein"><em>Human Genome Analysis</em></a> -<br>
[<a href="#gerstein-abstract">Abstract</a>]<br>
[<a href="#gerstein-qa">Q&A</a>]<br>
[<a href="http://lectures.gersteinlab.org/summary/Big-Data-in-Genome-Annotation-Using-Networks--20140325-i0keybdata/">slides</a>]</p>
<p><strong>Stuart Young</strong>, Annai Systems Inc., USA<br><br>
<a href="#young"><em>The BioCompute Farm: Colocated Compute for Cancer Genomics</em></a> -<br>
[<a href="#young-abstract">Abstract</a>]<br>
[<a href="#young-qa">Q&A</a>]</p>
<p><strong>Adam Butler</strong>, Wellcome Trust Sanger Institute, UK<br><br>
<a href="#butler"><em>Short Talk: Pan-Cancer Analysis of Somatic Variation from Whole Genome ICGC / TCGA Datasets</em></a> -<br>
[<a href="#butler-abstract">Abstract</a>]<br>
[<a href="#butler-qa">Q&A</a>]</p>
<p><strong>Maya M. Kasowski</strong>, Yale University, USA<br><br>
<a href="#kasowski"><em>Short Talk: Extensive Variation in Chromatin States Across Humans</em></a> -<br>
[<a href="#kasowski-abstract">Abstract</a>]<br>
[<a href="#kasowski-qa">Q&A</a>]</p>
<p><strong>Robert L. Grossman</strong>, University of Chicago, USA<br><br>
<a href="#grossman"><em>Short Talk: An Overview of the Bionimbus Protected Data Cloud</em></a> -<br>
[<a href="#grossman-abstract">Abstract</a>]<br>
[<a href="#grossman-qa">Q&A</a>]</p>
<hr>
<h2 id="a-namesteinthe-international-cancer-genome-co_2">
<a name="stein">The International Cancer Genome Consortium Database</a> <a class="head_anchor" href="#a-namesteinthe-international-cancer-genome-co_2">#</a>
</h2><h3 id="lincoln-d-stein-ontario-institute-for-cancer_3">Lincoln D. Stein, Ontario Institute for Cancer Research, Canada <a class="head_anchor" href="#lincoln-d-stein-ontario-institute-for-cancer_3">#</a>
</h3><blockquote>
<h4 id="a-namesteinabstractabstracta_4">
<a name="stein-abstract">Abstract</a> <a class="head_anchor" href="#a-namesteinabstractabstracta_4">#</a>
</h4>
<p>The International Cancer Genome Consortium (ICGC; <a href="http://www.icgc.org">www.icgc.org</a>) <a href="http://www.icgc.org/">http://www.icgc.org/</a> is a multinational effort to identify patterns of germline and somatic genomic variation in the major cancer types. Currently consisting of 71 cancer-specific projects spanning 18 different countries, ICGC has sequenced the tumor and normal genomes of over 10,000 donors (>20,000 genomes). When the current phase of the project is completed in 2018, we expect to have sequenced more than 25,000 donors.</p>
<p>All analyzed data from the project is available to the public, including clinical information about the donors, somatic mutations identified in the tumors, and the potential functional significance of these mutations. The raw sequencing data and other potentially-identifiable information is available to researchers who have signed an agreement promising not to attempt to identify the donors. The total data set is now 500 terabytes in size, but growing rapidly as the project switches from exome sequencing (sequencing just the transcribed regions of the genome) to whole-genome sequencing. We anticipate that the full data set will be on the order of 10 petabytes.</p>
<p>To maximize the utility of the data to the public, the analyzed data is available at the ICGC data portal (dcc.icgc.org) <a href="http://dcc.icgc.org/">http://dcc.icgc.org/</a>, where users can browse donors, mutations and genes using an attractive highperformance web application based on Elastic Search at the backend and AngularJS and D3.js on the front end. The portal uses faceted search as its dominant user interface metaphor. This allows researchers to pose general queries, such as “find all non-synonymous mutations” and then successively refine them “…affecting genes in the hedgehog pathway”, “…affecting donors with stage I disease.” A series of interactive graphics allows researchers to readily compare different sets of mutations, donors and genes.</p>
<p>A limitation of ICGC is that the raw sequencing data must still be downloaded from a static file repository. We are addressing this limitation by moving the data into the compute cloud, where software and data can be co-resident. In the Whole Genome Pan-Cancer Analysis Project, which began earlier this year, 2000 whole genome pairs from ICGC are being placed into several compute cloud analysis facilities to allow for uniform mutation-calling and data mining by ICGC researchers. In the “Cancer Genome Collaboratory”, a project just approved in March 2014, we will be placing the entire ICGC data set into two compute cloud centers for access by the general research community. I will talk about the challenges and solutions that we are working on in connection to these two projects.</p>
</blockquote><h4 id="notes_4">Notes <a class="head_anchor" href="#notes_4">#</a>
</h4>
<p>ICGC Project</p>
<ul>
<li>International Cancer Genome Sequencing Consortium</li>
<li>5th year of operation</li>
<li>multi-national collaboration</li>
<li>Includes all of the TCGA projects</li>
<li><strong>Goal: Identify the common patterns of mutation in all major cancer types</strong></li>
</ul>
<p>Simple experimental design:</p>
<ul>
<li>take normal (blood) and tumour (biopsy) samples from a series of donors</li>
<li>sequence</li>
<li>identify cancer-related mutations</li>
<li>relate mutations to tumor bio</li>
<li>translate this knowledge to improved diagnosis and treatment & make avail</li>
</ul>
<p>ICGC db growing in size - moved from exome sequencing to whole genome</p>
<ul>
<li>10K+ donors</li>
<li>4M+ somatic mutations</li>
<li>49K CNVs</li>
<li>6K+ methylation profiles</li>
</ul>
<p>Available to public - Website @ <a href="http://dcc.icgc.org">http://dcc.icgc.org</a></p>
<ul>
<li>very nice data browser</li>
<li>faceted view of various data types and donor types</li>
<li>changes in a context sensitive way</li>
<li>updates list with dynamically updated graphs/summary</li>
<li>links to raw data @ CGHub</li>
<li>view most mutated genes in selected cancer subtype.
Can keep drilling down through stats/projects.
Or look at summary - transcript level / protein level.</li>
</ul>
<p>Original Database - based on BioMart</p>
<ul>
<li>mysql based data mart - developed and used by EnSEMBL project</li>
<li>de-normalized data schema (reverse-star schema)</li>
<li>scaled well for human and other invertebrate genomes</li>
<li>worked well until release 12</li>
<li>One problem: as the data got larger, BioMart didn’t scale</li>
<li>Release 8 & 9: three month release cycle (freeze, prep, load, QC)</li>
<li>by release 11 - load phase taking 2-3 months! Missing release window. Were announcing new freeze before new db released</li>
</ul>
<p>September - complete rewrite of entire dcc (Ferretti). Heavy use of distributed computing. </p>
<p>Process:</p>
<ul>
<li>genome centres submit flat files + meta </li>
<li>validation (Hadoop cluster - HDFS distributed filesystem)</li>
<li>loaded into MongoDB (on cluster)</li>
<li>Combined w/ other info (gene annotation from Ensembl, uniprot, cosmic, etc)</li>
<li>Indexed by ElasticSearch (another cluster)</li>
<li>Indexed info stored in mongo - drives the portal</li>
<li>Total time for loading for release 15: 42 hours (not yet optimized)</li>
</ul>
<p>What about raw read data?</p>
<ul>
<li>~10 PB Genome data by 2018</li>
<li>depositing all genome data in EGA.
In theory, researches go to EGA and dl data.
In practice, data too large. Takes too long.</li>
<li>will soon be completely inaccessible - except maybe for some large groups, or those located in the UK</li>
<li>This is an important legacy dataset that can still be mined</li>
<li>Current mutation calling algorithms not perfect.
Different groups have low overlap. Different filtering systems. Many false positives (e.g. titan).
Our ability to predict gene rearragements quite poor.</li>
<li>want to go back to the data to get more info as our algorithms improve</li>
</ul>
<h5 id="the-solution-gt-the-pancancer-whole-genome-an_5">The solution => The Pan-Cancer Whole Genome Analysis Project (PAWG/Pan-Can) <a class="head_anchor" href="#the-solution-gt-the-pancancer-whole-genome-an_5">#</a>
</h5>
<ul>
<li><strong>Goal: understand what’s going on in the 95% of the cancer genome that isn’t protein-coding</strong></li>
<li>Resources: 2K whole genome tumor/normal pairs from ICGC</li>
<li>Analytic issues: calling cancer mutations in non-coding regions is an evolving art.
Need uniform pipeline.
Dataset - 0.5PB.</li>
<li>Cloud based approach - six cloud compute centres in USA, Europe, Asia</li>
<li>
<strong>Phase 1:</strong> Partition data among the data centres.
Perform alignment and mutation calling in a distributed fashion</li>
<li>
<strong>Phase 2:</strong> Synchronize alignments and mutation calls.
Each centre will have complete set of alignmetns and mut calls</li>
<li>
<strong>Phase 3:</strong> Open up (subset) of of clouds to allow researchers to do analysis</li>
</ul>
<p>Technologies: OpenStack (5 centers) and vCloud (EBI)</p>
<ul>
<li>Vagrant - vm abstraction layer (make clouds look similar)</li>
<li>network transfer and metadata - GNOS / GeneTorrent (from Annai Biosystems Inc) - commercial solution</li>
<li>Workflow management - SeqWare pipeline manager (OICR & UNC developed - O'Connor) synapse from sage</li>
</ul>
<p>Status</p>
<ul>
<li>Ethical approval, usage agreements signed - Legal</li>
<li>OpenStack/VMware, vagrant SeqWare installed</li>
<li>alignment workflows executed on some vms</li>
</ul>
<p>Challenges</p>
<ol>
<li>Legal - regional differences have not gone away.
Datasets from TCGA (us) can be hosted by certain US based institutions trusted by NIH.
NIH has not approved phase II of the project due to the way the consent was written. It can be interpreted as ‘not allowed to use on cloud’ (But cloud didn’t exist when the consent was written).
Europe - some countries are sensitive to distribute their data to US based data centres (Snowden & NSA).</li>
<li>Technical - adapting grid based hpcs to use cloud-based technologies.
Running 8 weeks behind</li>
</ol>
<p>Why not a commercial cloud? Amazon, Google, MS</p>
<ul>
<li>legal and ethical issues</li>
<li>preliminary ethics approval to ICGC. Some restrictions - can’t cross regulator borders without notice</li>
<li>NIH reviewing approval for TCGA sets</li>
</ul>
<p>What happens when Pan-Can is done ~ 1 year? The group has received funding from Canadian funders: <strong>The Cancer Genome Collaboratory</strong></p>
<ul>
<li>long-lived private cloud compute centre, pre-populated with ICGC datasets</li>
<li>any individual can create an account and access the data via api</li>
<li>have an integrated benchmarking core, bioethics, community outreach</li>
<li>Initially two physical data centres (w/ Grossman in Chicago) & Toronto. <strong>Connected by high speed link</strong>
</li>
<li>Funded as of March 1</li>
</ul>
<h4 id="a-namesteinqaq-amp-aa_4">
<a name="stein-qa">Q & A</a> <a class="head_anchor" href="#a-namesteinqaq-amp-aa_4">#</a>
</h4>
<p><strong>Q:</strong> (Ware) <strong>Many of us have been using BioMart and the scalability - how portable is your new system as a replacement for BioMart?</strong></p>
<p><strong>A:</strong> on a scale of 0 - 100: -1. This is a highly specialized system designed just to work with our data. Biomart is alive and well in Italy</p>
<p><strong>Q: What cancer types were chosen for the pan-cancer analysis? And why?</strong></p>
<p><strong>A:</strong> Our criteria for inclusion is at least 30x coverage for whole genome, tumor normal pair, proper consent from donor.<br>
Of that, we have ovarian, breast, lung, pancreatic, liver, leukemias – about 13 in all<br>
The final list of tumor types won’t be selected till we’ve qc'ed al the data and know what the distribution is</p>
<p>*<em>Q: If the 10PB of data that will be generated will be harmful - look at quality compression and other *</em></p>
<p><strong>A:</strong> No chance that we’ll be storing adn distributing full uncompressed 10PB. Actively benchmarking compression systems. Hopefully get it down to a few PB without loss of information</p>
<p><strong>Q: What is the main objective of this project? Biological objective?</strong></p>
<p><strong>A:</strong> The main biological object - focusing on patterns of alteration in non-coding regions. E.g. know there are mutations in regulatory regions - we haven’t characterized.<br>
groups looking at: </p>
<ol>
<li>Looking at regulatory networks - interactions wiht coding regions.<br>
</li>
<li>Patterns of rearrangement<br>
</li>
<li>Evidence of insertion of known and unknown pathogens / virus that may be driving the tumours<br>
</li>
</ol>
<p>Looking at this in a uniform way we’ll learn common mechanism and mechanisms that are distinct</p>
<p><strong>Q: How willing are your users to get random samples in return as opposed to the full data? Plus confidence score</strong></p>
<p><strong>A:</strong> Key method of access - take slices of the raw data in the region that you’re interested in. Or extend and do a random sampling - feature available of CGHub and widely used. Not a feature of EGA - annoying deficit. One of the reasons we want to move away.</p>
<p><strong>Q: Majority of researchers - don’t need to develop alignment algorithms. Are processed data available to researchers?</strong></p>
<p><strong>A:</strong> The interpreted data (still large, but much smaller - in GB not TB) is available for browsing and dl and abstraction and available from <a href="http://dcc.icgc.org">http://dcc.icgc.org</a></p>
<p><strong>Q: Curious how you are designing your APIs? APIs for visualization are different from tools</strong></p>
<p><strong>A:</strong> Start with the user interface, figure out what it needs to display, and work back to the API. A genome browser has a very different api than the faceted browser where you’re looking at a particular biological pathway. Specialized APIs and indexes for each of those.</p>
<p><a href="#schedule">back to the speaker list →</a></p>
<hr>
<h2 id="a-nameroyyuruthe-genographic-projecta_2">
<a name="royyuru">The Genographic Project</a> <a class="head_anchor" href="#a-nameroyyuruthe-genographic-projecta_2">#</a>
</h2><h3 id="genome-analytics-with-ibm-watson_3">Genome Analytics with IBM Watson <a class="head_anchor" href="#genome-analytics-with-ibm-watson_3">#</a>
</h3>
<p>Ajay Royyuru, IBM T.J. Watson Research Center, USA<br><br>
Director of computational biology </p>
<blockquote class="short">
<h4 id="a-nameroyyuruabstractabstracta_4">
<a name="royyuru-abstract">Abstract</a> <a class="head_anchor" href="#a-nameroyyuruabstractabstracta_4">#</a>
</h4>
<p><em>// last minute topic change, no published abstract</em></p>
<p>Press release: <a href="http://www-03.ibm.com/press/us/en/pressrelease/43444.wss">http://www-03.ibm.com/press/us/en/pressrelease/43444.wss</a></p>
</blockquote><h4 id="notes_4">Notes <a class="head_anchor" href="#notes_4">#</a>
</h4>
<p>Research group at IBM - very focused on computational biology.<br><br>
Intersection of everything IT and Life Sciences.</p>
<p>3 pillars of work (IBM computational biology)</p>
<ol>
<li>managing and analyzing the data explosion - makes biology more amenable to quantitative outcomes</li>
<li>predicting biological outcomes with scale of computing</li>
<li>dealing with complexity. DREAM - IBM team with community is heavily involved</li>
</ol>
<p>Why:</p>
<ul>
<li>Intruiged by connections made yesterday (DH, JM)</li>
<li>Sequencing is reaching a point where we have to look at the translational aspects</li>
<li>beginning to make an impact in teh clinic</li>
<li>takes a community</li>
<li>IBM Watson - can be used here</li>
<li>On IBM’s cloud system - rapidly scale. The sorts of analytics capabilities - it begins to be scalable and accessible so it can have the impact on the clinic down the road</li>
</ul>
<p>What are we up to: Gathering raw sequencing input, through large number of steps so that we will eventualy get useful info that may lead to action</p>
<p>3 pillars in the journey of genomic medicine</p>
<ol>
<li>sequencing (includes downstream analysis - variant calling)</li>
<li>translational medicine (have VCF) <– will focus on this piece (VCF to actionable)</li>
<li>Actionable intelligence - Personalized healthcare. Something publishable is our goal</li>
</ol>
<h5 id="translational-medicine_5">Translational Medicine: <a class="head_anchor" href="#translational-medicine_5">#</a>
</h5>
<p>System that generates insights</p>
<p>Input:</p>
<ol>
<li>data coming from sequencing (VCF) - patient specific information</li>
<li>Entirety of what you can point Watson to - All available biological knowledge (PubMed, NCI PDQ)</li>
</ol>
<p>All this is ingested. Running on IBM’s cloud layer (SoftLayer) - large/global/scalable/acquired by IBM.<br>
Generates some actionable insights. <br>
Goal: this goes to tumor oncologists, look at data in context of decision trying to make. Hopefully make informed correct decision.</p>
<h5 id="ibm-watson_5">IBM Watson <a class="head_anchor" href="#ibm-watson_5">#</a>
</h5>
<ul>
<li>began 2008 - research project</li>
<li>Jeopardy - grand challenge (got attention)</li>
<li>Added genomics capabilities!</li>
</ul>
<h5 id="genomics-not-just-about-genes-how-we-connect_5">Genomics - not just about genes. How we connect that knowledge <a class="head_anchor" href="#genomics-not-just-about-genes-how-we-connect_5">#</a>
</h5>
<p>The traditional way: read papers, develop hypotheses -> interpretation -> actionable output. Can we automate this? Can we come up with new research approaches from the literature?</p>
<h5 id="p53-project-example-ingest-a-lot-mine-the-lit_5">p53 project example - ingest a lot - mine the literature. <a class="head_anchor" href="#p53-project-example-ingest-a-lot-mine-the-lit_5">#</a>
</h5>
<ul>
<li>lots of text, natural language, analytics happening</li>
<li>specific to diseases, compounds (drug molecules)</li>
<li>Human readable sentences - use Watson based technology to translate the information into machine readable.
‘the results who that EPK2 phosphorylated p53 at Thr55’ - extract info with Watson</li>
<li>Extraction is working</li>
</ul>
<p>Application to genomics: <br>
on SoftLayer, physican managing cases (biopsy samples) submission - uploading VCF.<br>
What analysis can be done - </p>
<ul>
<li>circos representation, where they occur, where translate to</li>
<li>map to available info on pathways </li>
<li>what more can you find in liternature, Watson? - adds links (to literature) from text mining.
Can drill down and find out why links were generated</li>
<li>Drugs - targetting pathways: added in datamodel</li>
</ul>
<p>Summary: researcher can browse. print report for the record.</p>
<ul>
<li>see provenance of the data and keep a record of it</li>
<li>see all visualizations, records, summary</li>
<li>possible list of all possible drugs, status (approved?)</li>
<li>this insight is available to the research</li>
</ul>
<p>Looking for active collaborations - dont’ generate this data themselves</p>
<ul>
<li>last week: partnership with NY genome centre (collaboration of research centres in NY area).
Can take this technology and apply it with them.
Get practical use of this technology</li>
<li>Not exclusive to NY genome, can open collaborations with others</li>
</ul>
<p>Sample report- generated with early data</p>
<ul>
<li>TCGA GBM data - reshaped to put in system</li>
<li>generated report (many pages long)</li>
<li>list of drugs with reasons why the drug is contextually relevant</li>
</ul>
<p>e.g. Lidocaine in report: not prepared to see this in here</p>
<ul>
<li>showed to oncologists - click through to evidence.
Watson points to papers - Lidocaine assay on cancer cells (tongue, EGFR receptor). Lidocaine being tested in context of thyroid cancer cells</li>
<li>so this is not out of the realm of what we should be thinking about</li>
<li>helps us be current and comprehensive</li>
</ul>
<h4 id="a-nameroyyuruqaq-amp-aa_4">
<a name="royyuru-qa">Q & A</a> <a class="head_anchor" href="#a-nameroyyuruqaq-amp-aa_4">#</a>
</h4>
<p><strong>Q:</strong> (Ouellette) <strong>Do you have any evidence on how Watson will do if it read full papers (not just abstracts)?</strong></p>
<p><strong>A:</strong> Not tested in this context. Watson does read full papers in a clinical context </p>
<p><strong>Q:</strong> (Mesirov) -<br><br>
<strong>1. Are you aiming with that package towards the practicing oncologists or the research physician?</strong><br><br>
<strong>2. To what extent have you compared what Watson is able to mine from the data with other approaches/algorithms/packages published and available to the community?</strong></p>
<p><strong>A:</strong> </p>
<ol>
<li>It’s a journey - early adopters, research clinicians who have the expertise and interest to be partners. A lot of learning. For example, Watson shows lots of evidence. You need a clinician research who understands the subtleties of the research and how to make decisions that will be useful</li>
<li>Not whole scale comparison yet - still in ingest and build mode. Some benchmarking and testing - working on the baseline. Full scale comparison for later. Watson can also do chemical extraction - full scale comparison here.</li>
</ol>
<p><strong>Q:</strong> </p>
<ol>
<li>
<strong>Is there any way to integrate other sources of information not text based? Images? Protein structures?</strong><br>
</li>
<li><strong>human value added in human curation databases?</strong></li>
</ol>
<p><strong>A:</strong> </p>
<ol>
<li>Image analytics is an interest to us. Study going on here. Working with some large medical institutions on this project.</li>
<li>Melding between machine and human curation -> this accelerates the process. Makes it more usable.</li>
</ol>
<p><strong>Q: Doubts whether practicing physician will know what VCF is, understand Cicos plot? Watson to user or user to Watson?</strong></p>
<p><strong>A:</strong> Initial set of end users - clinician researchers. They got the sample, they know what VCF is. This is the community that will find this useful. What can we simplify to make this more useable. <br>
Right now, collaboration.</p>
<p><a href="#schedule">back to the speaker list →</a></p>
<hr>
<h2 id="a-namegersteinhuman-genome-analysisa_2">
<a name="gerstein">Human Genome Analysis</a> <a class="head_anchor" href="#a-namegersteinhuman-genome-analysisa_2">#</a>
</h2><h3 id="mark-gerstein-yale-university-usa_3">Mark Gerstein, Yale University, USA <a class="head_anchor" href="#mark-gerstein-yale-university-usa_3">#</a>
</h3>
<p>Director: computational biology<br>
ENCODE, 1000 genomes</p>
<blockquote>
<h4 id="a-namegersteinabstractabstracta_4">
<a name="gerstein-abstract">Abstract</a> <a class="head_anchor" href="#a-namegersteinabstractabstracta_4">#</a>
</h4>
<p>Plummeting sequencing costs have led to a great increase in the number of personal genomes. Interpreting the large number of variants in them, particularly in non-coding regions, is a central challenge for genomics.</p>
<p>One data science construct that is particularly useful for genome interpretation is networks. My talk will be concerned with the analysis of networks and the use of networks as a “next-generation annotation” for interpreting personal genomes. I will initially describe current approaches to genome annotation in terms of one-dimensional browser tracks. Here I will discuss approaches for annotating pseudogenes and also for developing predictive models for gene expression.</p>
<p>Then I will describe various aspects of networks. In particular, I will touch on the following topics: (1) I will show how analyzing the structure of the regulatory network indicates that it has a hierarchical layout with the “middle-managers” acting as information-flow bottlenecks and with more “influential” TFs on top. (2) I will show that most human variation occurs at the periphery of the network. (3) I will compare the topology and variation of the regulatory network to the call graph of a computer operating system, showing that they have different patterns of variation. (4) I will talk about web-based tools for the analysis of networks (TopNet and tYNA).</p>
<p><a href="http://networks.gersteinlab.org">http://networks.gersteinlab.org</a><br><br>
<a href="http://tyna.gersteinlab.org">http://tyna.gersteinlab.org</a> </p>
<p>Architecture of the human regulatory network derived from ENCODE data.<br><br>
Gerstein et al. Nature 489: 91</p>
<p>Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors.<br><br>
KY Yip et al. (2012). Genome Biol 13: R48.</p>
<p>Understanding transcriptional regulation by integrative analysis of transcription factor binding data.<br><br>
C Cheng et al. (2012). Genome Res 22: 1658-67.</p>
<p>The GENCODE pseudogene resource.<br><br>
B Pei et al. (2012). Genome Biol 13: R51.</p>
<p>Comparing genomes to computer operating systems in terms of the topology and evolution of their regulatory control networks.<br><br>
KK Yan et al. (2010). Proc Natl Acad Sci U S A 107:9186-91.</p>
</blockquote><h4 id="slides_4">Slides <a class="head_anchor" href="#slides_4">#</a>
</h4>
<p><a href="http://lectures.gersteinlab.org/summary/Big-Data-in-Genome-Annotation-Using-Networks--20140325-i0keybdata/">http://lectures.gersteinlab.org/summary/Big-Data-in-Genome-Annotation-Using-Networks–20140325-i0keybdata/</a></p>
<h4 id="notes_4">Notes <a class="head_anchor" href="#notes_4">#</a>
</h4><h5 id="my-perspective-on-big-data_5">My perspective on Big Data <a class="head_anchor" href="#my-perspective-on-big-data_5">#</a>
</h5>
<ul>
<li>buzz word, data science</li>
<li>HBR - data science the sexiest job of the 21st century (<a href="http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ar/1">http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ar/1</a>)</li>
<li>transforming science</li>
<li>explosion of data in genomics - sequencing price going down faster than Moore’s law.
Cost is in management of data</li>
<li>Current state of large sequencing dataset TCGA 910 TB in CGHub, + smaller datasets</li>
</ul>
<h5 id="what-do-people-do-with-big-data_5">What do people do with big data? <a class="head_anchor" href="#what-do-people-do-with-big-data_5">#</a>
</h5>
<p>Take this data to answer a question, make a prediction, modelling</p>
<p>Two ways to approach:</p>
<ol>
<li>don’t care about structure, just want answer (google search)</li>
<li>with explicit organization of dataset (google maps, google earth)</li>
</ol>
<p>In science - search for Higgs boson - searching through many for a few needles (fits in #1)</p>
<p>In genomics - we’re in #2</p>
<ul>
<li>we want to make a map of the molecular world we have</li>
<li>but we don’t have an immediate metaphor we can hang all our information on</li>
<li>but we don’t know what the structure of that map is</li>
<li>ENCODE - thought about the structure of the map. Layer information down</li>
<li>genomics has been around for a while - one of the first big data disciplines.
Inspired by pandora - music genome project which was inspired by how geneticists organize information.
We should learn from other disciplines</li>
</ul>
<h5 id="how-we-can-organize-information-in-genomics-n_5">How we can organize information in genomics - networks <a class="head_anchor" href="#how-we-can-organize-information-in-genomics-n_5">#</a>
</h5>
<ul>
<li>regulatory networks as a hierarcy</li>
<li>more connectivity - constraint</li>
</ul>
<h5 id="what-is-genome-annotation_5">What is genome annotation? <a class="head_anchor" href="#what-is-genome-annotation_5">#</a>
</h5>
<p>Tracks in genome browser - linear view of how to think of genome.<br>
How will this scale with thousands of tracks? No</p>
<p>What type of information do we want? Actually thinking of 3D molecules - but not quite possible</p>
<p>Network diagram - middle ground</p>
<ul>
<li>works for cancers/biology pathways</li>
<li>compelling approach to big data</li>
<li>Example: we started off with linear annotation (ChIP-Seq experiments)</li>
<li>Then, created proximal edge at peaks.<br>
Generated a hairball of .5million edges, paired down to 25K edges.<br>
Many edges far away from genes - distal sites.</li>
</ul>
<p>analyze networks - network science</p>
<ul>
<li>Hub - point with many neighbours</li>
<li>bottleneck - max # of shortest paths</li>
<li>Identify bottlenecks & hubs (like roads, bridges can be bottlenecks)</li>
</ul>
<p>Directed entities - regulatory networks</p>
<ul>
<li>one thing regulates another</li>
<li>Hierarchy - intuitive - people understand this</li>
<li>optimally arrange transcription factors (ENCODE) into 3 levels by simulated annealing, maximizing downward pointing edges</li>
<li>higher bottleneck-ness in centre layer - information flow</li>
<li>Can think about molecules - does this make sense for molecules.<br>
Integration of TF hierarchy with other ‘omic information.<br>
More connected and influential on top</li>
<li>Same thing with miRNA networks (bi directional)</li>
<li>Can look at how transcription factors are working together.
Pick two, can look at the degree they co-regulate the target</li>
</ul>
<h5 id="other-organisms-yeast-genome_5">Other organisms: Yeast genome <a class="head_anchor" href="#other-organisms-yeast-genome_5">#</a>
</h5>
<p>Similar, but has four levels. Multi-regulated network with bottlenecks</p>
<p>Different types of hierarchies</p>
<ol>
<li>autocratic (military)</li>
<li>democratic (things at top mostly regulating, bottom mostly being regulated)</li>
<li>intermediate - between the two. Ease some information bottlenecks</li>
</ol>
<p>Developed a scheme to measure the degree of x-linking structure. Degree of collaboration</p>
<ul>
<li>number of overlapping </li>
<li>find over many organisms: get a lot more confidence that inclusions are true</li>
<li>middle layer has highest degree of collaboration</li>
</ul>
<p>Compare humans w/ E. coli & yeast & rat: humans more collaborative nodes</p>
<p>Yeast network similar structure to government hierarchy w/ middle managers: matches gov’t of Macao</p>
<p>Social science - literature on people studying how important you need middle managers talking to each other</p>
<p>Variation network</p>
<ul>
<li>map all SNPs in 1000 genomes on network</li>
<li>more SNPs at bottom</li>
<li>higher parts of hierarchy more conserved, less variable</li>
<li>Trend: more hubs - less variation/ more connectivity, more constraint.<br>
Seen in many studies/organisms.<br>
Human protein-protein interaction network - rapidly changing on the outskirts</li>
</ul>
<h5 id="analogy-to-understand-more-connectivity-gt-mo_5">Analogy to understand more connectivity -> more constraint <a class="head_anchor" href="#analogy-to-understand-more-connectivity-gt-mo_5">#</a>
</h5>
<p>Comparison between e. coli regulatory network and Linux OS</p>
<ul>
<li>call graph in linux compared to e. coli regulatory network</li>
<li>linux is top heavy in comparison</li>
<li>
<em>E. coli</em>: dominated by out degree hubs - turn on a lot of molecules</li>
<li>linux: dominated by in hubs - routines called by many programs</li>
<li>linux OS evolves - we can watch it through each of its releases</li>
<li>plot changes & compare.<br>
<em>E. coli</em>: less change.<br>
Linux: certain that don’t change, some things change constantly. Some releases coupled to hardware, has to change</li>
<li>In biological system - negative correlation connectivity is less change</li>
<li>In linux - positive correlation - connectivity is more change</li>
<li>Perspectives on random change v. Intelligent Design.<br>
Intelligent designer - they believe they can make changes where there is a lot of constraint and connectivity.<br>
If changes are random - best to not put them in central points</li>
</ul>
<p>Applications of more connectivity leads to more constraint - no time to talk today. Building a practical workflow & tool for disease genomes.</p>
<p>Network stuff available - <a href="http://encodenets.gersteinlab.org/">encodenets.gersteinlab.org</a></p>
<h4 id="a-namegersteinqaq-amp-aa_4">
<a name="gerstein-qa">Q & A</a> <a class="head_anchor" href="#a-namegersteinqaq-amp-aa_4">#</a>
</h4>
<p><strong>Q:</strong> (Stein) <strong>you showed this relationship between Hub-ness and Kernel call graph. Have you looked at the evolution of the call signature? Highly connected subroutines do not have their call signature called frequently - more similar to bio</strong></p>
<p><strong>A:</strong> No, very interested in that. Evolution - even package dependencies.</p>
<p><strong>Q: Information flow: makes sense in regulatory networkers. What’s your reasoning with protein-protein networks?</strong></p>
<p><strong>A:</strong> Some times of protein-protein interaction networks, but other times not so much. Key network params - regulatory, focused on bottlenecks. Protein-protein - focus on hubs. When you do the correlations of connectivity with constraint - more on bottlenecks.</p>
<p><strong>Q: Interested in <em>E. coli</em> v. linux - we compare a lot to engineering ideas</strong></p>
<p><strong>A:</strong> Maybe not a lot of engineering ideas apply to biology. Sometimes people look at biological networks to apply to engineering problems</p>
<p><strong>Q: have you looked at hubs in organisms with recent genome duplications to see how they occur?</strong></p>
<p><strong>A:</strong> genome duplicates, suddenly have these two things interact with your hub or what’s there. Lots of network literature on scale free networks - plays into that.</p>
<p><strong>Q: What do you think about the cell type specificity - do you think different cells depending on their needs will have different hierarchies?</strong></p>
<p><strong>A:</strong> Controversy in how I present this. Cell type non-specific hierarchy - this is a global wire diagram. In my mind, if you go to certain cell time, certain lights turn on. Other view - cell type specific hierarchies. I think this doesn’t make sense - no one talks about gene list</p>
<p><a href="#schedule">back to the speaker list →</a></p>
<hr>
<h2 id="a-nameyoungthe-biocompute-farm-colocated-comp_2">
<a name="young">The BioCompute Farm: Colocated Compute for Cancer Genomics</a> <a class="head_anchor" href="#a-nameyoungthe-biocompute-farm-colocated-comp_2">#</a>
</h2><h3 id="stuart-young-annai-systems-inc-usa_3">Stuart Young, Annai Systems Inc., USA <a class="head_anchor" href="#stuart-young-annai-systems-inc-usa_3">#</a>
</h3><blockquote>
<h4 id="a-nameyoungabstractabstracta_4">
<a name="young-abstract">Abstract</a> <a class="head_anchor" href="#a-nameyoungabstractabstracta_4">#</a>
</h4>
<p>Pedabyte-scale genomic data repositories such as the Cancer Genomics Hub (CGHub) require collocated compute resources to fully leverage the value of the genomic data. The traditional model of data download from a repository to a research center followed by local computational analysis suffers from high file transfer costs, significant delays and file storage problems. The BioCompute Farm, a highly-scalable computing resource colocated with CGHub, provides a 99.9% reduction in data storage and 120 times reduction in time for analysis of all 40TB of the current Cancer Genome Atlas (TCGA) RNA-Seq data set. The BioCompute Farm combines high-speed BAM slicing for DNA analysis and the latest in bioinformatics tools and standardized pipelines with the flexibility to customize pipelines and rapidly scale up computational capacity to meet the needs of cancer researchers. As data growth continues to outpace the growth of Internet bandwidth, the BioCompute Farm can serve as a model for the emerging paradigm of colocated compute resources serving the users of large genomic databases.</p>
</blockquote><h4 id="notes_4">Notes <a class="head_anchor" href="#notes_4">#</a>
</h4><h5 id="motivation-for-talk-why-colocated-compute_5">Motivation for talk: why colocated compute <a class="head_anchor" href="#motivation-for-talk-why-colocated-compute_5">#</a>
</h5>
<ul>
<li>'07/'08 - next gen suddenly became a viable product</li>
<li>before this, fairly expensive Sanger sequencing</li>
<li>soon - began to overshoot the cost of storage and bandwidth</li>
<li>only will become worse</li>
<li>to address this: need to provide a solution to provide capacity and service</li>
</ul>
<h5 id="annai-systems-director-of-bioinformatics_5">Annai systems: director of bioinformatics <a class="head_anchor" href="#annai-systems-director-of-bioinformatics_5">#</a>
</h5>
<ul>
<li>Software underpinning CGHub - Annai-GNOS</li>
<li>server to genetorrent - dl sequences</li>
<li>bioCompute -colocated w/ CGHub </li>
</ul>
<p>How big is this problem?</p>
<ul>
<li>TCGA data ~ 1PB, -> 2.5 in the next few years</li>
<li>download rates: several months to download it all. Store it. Need infrastructure.</li>
<li>researches limited by financial and logistical constraints (IT)</li>
</ul>
<p>Survey by NCI - wish list for cancer genomics researchers</p>
<ul>
<li>#1 Run workflows on data in cloud (13%)</li>
<li>Annai covered about 50% of what they want. Maybe biased sample (online)</li>
</ul>
<p>NCI’s colocation model</p>
<ul>
<li>Genomic Data Commons - integrate multiple datatypes, provide API</li>
<li>Cloud Pilots - $20M, colocated compute.
The successful bidders will provide workflows and be scalable</li>
</ul>
<p>BioCompute Farm (TCGA data)</p>
<ul>
<li>what they’re doing with sequencing - shifts cost of sequencing to getting data and results out</li>
<li>upstream costs: technology development, pipelines, bioinfo tools</li>
<li>downstream costs: tools for sequence analysis, management of </li>
</ul>
<p><em>/// LOST CONNECTIVITY FOR AWHILE/ //</em></p>
<p>HIPPA Compliance</p>
<ul>
<li>wholistic expectation - bookkeeping where access is controlled</li>
<li>Physical security: Cage in SDSC - monitored, power, alarms</li>
</ul>
<p>Provide farms with subscription based access</p>
<p>Provide custom analysis</p>
<ul>
<li>farm loaded with standard pipelines: broad GATK, PanCancer BWA alignment</li>
<li>Custom Pipelines - latest versions</li>
<li>Workflow tools: SeqWare (O’Connor), agua, synapse</li>
<li>Use Case Baylor - BAM-slicing of TCGA RNA-Seq data
<ul>
<li>would have taken 9weeks of dl time + storage (no capacity)</li>
<li>They used biocompute farm, used bam =0slicing of CGHub bam files on Annai’s GTFuse</li>
</ul>
</li>
<li>Pipeline Optimization - look at runtimes, will this benefit w/ parallelization or throwing more cpu?</li>
</ul>
<h5 id="collaborations_5">Collaborations <a class="head_anchor" href="#collaborations_5">#</a>
</h5>
<p>PanCancer project</p>
<ul>
<li>prototype of global federated colocated compute</li>
<li>setting up servers, SeqWare, </li>
</ul>
<p>DREAM challenge</p>
<ul>
<li>variant calling </li>
<li>Annai provides GNOS platform for data security and download</li>
</ul>
<p>ShareSeq</p>
<ul>
<li>hosting ICGC- common free access to download free data</li>
<li>provide colo-compute</li>
</ul>
<h5 id="conclusion_5">Conclusion <a class="head_anchor" href="#conclusion_5">#</a>
</h5>
<ul>
<li>colo compute is a no brainer</li>
<li>useful functionalities - fast access, flexible use, tools for workflow, and custom analysis and scalability</li>
</ul>
<h4 id="a-nameyoungqaq-amp-aa_4">
<a name="young-qa">Q & A</a> <a class="head_anchor" href="#a-nameyoungqaq-amp-aa_4">#</a>
</h4>
<p><strong>Q: Only 5 or 10 labs in the world are interested in whole PB scale data. I think if we make the VCF file available - this should be sufficient for most researchers.</strong></p>
<p><strong>A:</strong> I think with the way things are going, the issue is not only going to be huge data access, but secure access, and how can we search through the data to find the datasets you want.</p>
<p><strong>Q: Most of the pipelines are focused on variant calling, alignments - what are the priorities for what’s next?</strong></p>
<p><strong>A:</strong> Yes, it’s variant calling right now. One other area of interest- systems approach, pathways, integrating different types of data. Looking at different standards, read pathology or clinical data. Hospital data is very rich for researchers, but not very accessible. Looking at integrating with genomic data.</p>
<p><a href="#schedule">back to the speaker list →</a></p>
<hr>
<h2 id="a-namegrossmanshort-talk-an-overview-of-the-b_2">
<a name="grossman">Short Talk: An Overview of the Bionimbus Protected Data Cloud </a> <a class="head_anchor" href="#a-namegrossmanshort-talk-an-overview-of-the-b_2">#</a>
</h2><h3 id="robert-l-grossman-university-of-chicago-usa_3">Robert L. Grossman, University of Chicago, USA <a class="head_anchor" href="#robert-l-grossman-university-of-chicago-usa_3">#</a>
</h3><blockquote>
<h4 id="a-namegrossmanabstractabstracta_4">
<a name="grossman-abstract">Abstract</a> <a class="head_anchor" href="#a-namegrossmanabstractabstracta_4">#</a>
</h4>
<p>Bionimbus is a petabyte scale community cloud for managing, analyzing and sharing large genomics datasets that is operated by the not-for-profit Open Cloud Consortium. With a cloud computing model, large genomic datasets can be analyzed in place without the necessity of moving it to your local institution. Bionimbus contains a variety of open access datasets, including ENCODE and the 1000 Genomes dataset. In 2013, we updated Bionimbus so that researchers can analyze data from controlled access datasets, such as The Cancer Genome Atlas (TCGA) in a secure and compliant fashion. We describe some case studies using Bionimbus, some of the bioinformatics tools available with Bionimbus, some different ways of interoperating with Bionimbus, the Bionimbus architecture, and the security and compliance framework.</p>
<p>The Bionimbus Protected Data Cloud is supported in by part by NIH/NCI (grant NIH/SAIC Contract 13XS021 / HHSN261200800001E), the Gordon and Betty Moore Foundation, and the National Science Foundation (Grants OISE - 1129076 and CISE 1127316). </p>
</blockquote><h4 id="notes_4">Notes <a class="head_anchor" href="#notes_4">#</a>
</h4>
<p>I’m going to pose a few questions. In the next 10 min I will not try to answer them. Hopefully your answers will be more interesting than mine. I will give you a framework of how we think of big data.</p>
<h5 id="four-questions_5">Four questions <a class="head_anchor" href="#four-questions_5">#</a>
</h5>
<ol>
<li>Is big data in bioinfo/biomed any different than big data in science. Is big data in science any different from big data general?</li>
<li>what instrument should we use to make discoveries over big biomed data?</li>
<li>do we need new types of mathematical and stat models for big biomed data?</li>
<li>how do we organize our data?</li>
</ol>
<h5 id="bionimbus-protected-data-cloud_5">Bionimbus protected data cloud <a class="head_anchor" href="#bionimbus-protected-data-cloud_5">#</a>
</h5>
<p>Supporting Pan-Can analysis - open source core</p>
<ul>
<li>interoperate with as much proprietary as they can</li>
<li>log in with NIH/eRA credentails - immediate access to TCGA data</li>
<li>pipelines, analysis, install your own software</li>
</ul>
<p>Right now process of scaling up</p>
<ul>
<li>10-20 projects a month</li>
<li>contain TCGA data- operate on PB scale</li>
<li>sometime next week, another PB of data & 16K cores, ICGC Pan-Can analysis</li>
<li>question: how do we make sure, on this limited resource, we get the most science out?.
Traditionally handled by allocation committees</li>
<li>this month, would have cost >$3K on amazon</li>
</ul>
<h5 id="open-science-data-cloud_5">Open science data cloud <a class="head_anchor" href="#open-science-data-cloud_5">#</a>
</h5>
<ul>
<li>support integrative analysis:
Can look at how disease is impacted by socio-economic factors and more.
Text analytics & geospacial analytics</li>
<li>4 years old (Bionumus 1 year)</li>
</ul>
<p>biomedical commons cloud</p>
<ul>
<li>involves cancer centres, open source core but operates with proprietary software around it</li>
<li>want to peer at scale with other providers (biomed commons providers)</li>
<li>like how internet was started with tier one ISPs</li>
<li>sometimes faster to get data at high performance network than over disk with certain protocols</li>
</ul>
<h5 id="new-era_5">New era <a class="head_anchor" href="#new-era_5">#</a>
</h5>
<ul>
<li>'05-'15: bioinformatic tools and integration (Galaxy, GenomeSpace, workflows, portals)</li>
<li>'10-'20: data center scale science (Bionimbus, CGHub, cancer collaboratory).
At that scale what changes and how do we build models</li>
<li>'15-'25: new modelling techniques</li>
</ul>
<p>What are the new models? '72 phil anderson wrote a piece: is more different</p>
<ul>
<li><a href="http://robotics.cs.tamu.edu/dshell/cs689/papers/anderson72more_is_different.pdf">http://robotics.cs.tamu.edu/dshell/cs689/papers/anderson72more_is_different.pdf</a></li>
<li>up to us to decide if is more the same and if it is how do we model that</li>
<li>backlash on google flu</li>
</ul>
<p>How do you scale machine learning to data centers?</p>
<ul>
<li>take large complex datasets and chop them up in small pieces you can analyze at scale</li>
</ul>
<p>Is more different at this scale? And if so, how do we discover it?</p>
<h4 id="a-namegrossmanqaq-amp-aa_4">
<a name="grossman-qa">Q & A</a> <a class="head_anchor" href="#a-namegrossmanqaq-amp-aa_4">#</a>
</h4>
<p><strong>Q:</strong> (Ware) <strong>as you see these data centres emerging, do you think they’ll focus on specific questions? How do you see the data centres forming?</strong></p>
<p><strong>A:</strong> The ones I mentioned are around cancer genomics. Sustainability and payment - putting small taxes on certain of our projects so that we can make larger amounts of our data available. Driven by some funding agencies. There’s a certain interest of private donors funding certain parts of this. Some economic incentives. Some combination of that is going to change the way we do science.</p>
<p><a href="#schedule">back to the speaker list →</a></p>
<hr>
<h2 id="a-namebutlershort-talk-pancancer-analysis-of_2">
<a name="butler">Short Talk: Pan-Cancer Analysis of Somatic Variation from Whole Genome ICGC / TCGA Datasets</a> <a class="head_anchor" href="#a-namebutlershort-talk-pancancer-analysis-of_2">#</a>
</h2><h3 id="adam-butler-wellcome-trust-sanger-institute-u_3">Adam Butler, Wellcome Trust Sanger Institute, UK <a class="head_anchor" href="#adam-butler-wellcome-trust-sanger-institute-u_3">#</a>
</h3><blockquote>
<h4 id="a-namebutlerabstractabstracta_4">
<a name="butler-abstract">Abstract</a> <a class="head_anchor" href="#a-namebutlerabstractabstracta_4">#</a>
</h4>
<p>The advent of massively parallel sequencing technology has revolutionised the way we characterise cancer genomes and provided new insights in our understanding of the mechanisms of oncogenesis. The International Cancer Genome Consortium (ICGC) was instigated in 2007 with the aim to systematically screen hundreds of Cancer Genomes for 50 distinct tumour types and catalogue the somatic variation present. This endeavor aims to prevent duplication of effort, ensure rare tumours are included and generate large datasets for the scientific community. A similar project is underway in the USA, The Cancer Genome Atlas (TCGA).</p>
<p>In late 2013 at the ICGC conference in Toronto, Peter Campbell announced an ambitious plan to undertake a Pan-Cancer analysis of whole genome data available from ICGC and TCGA. This would provide a comprehensive dataset of somatic variant calls with standardised output for 2,000 cancer genomes, which will be available for subsequent downstream analyses.</p>
<p>The primary analysis will include detection of somatic point mutations, small insertions and deletions, copy number changes, rearrangements and retrotransposon/viral integration sites. To ensure integrity of the dataset, three independent analysis pipelines, provided by the Broad Institute, DKFZ and the Sanger Institute, will be utilised. The data will be generated and stored at 6 data centres around the world; Spain, Germany, Japan, UK, and two centres in the USA. </p>
<p>The Sanger Institutes contribution to this initiative is to provide our analysis pipeline as one of three to be run over the data. Consequently our algorithms have been assessed via rigorous comparison with comparable software and their performance optimised. The pipeline is currently being ported into a VM (Virtual Machine), automated and the code adapted for running all variant detection analyses within a cloud environment.</p>
<p>The primary analysis will deliver a high-quality catalogue of somatic variants in a standardised VCF format and made available from the six centres for downstream investigation.</p>
</blockquote><h4 id="notes_4">Notes <a class="head_anchor" href="#notes_4">#</a>
</h4>
<p>Go over our part and experience with the Pan-Cancer analysis with large datasets</p>
<h5 id="the-cancer-genome-project_5">The Cancer Genome Project <a class="head_anchor" href="#the-cancer-genome-project_5">#</a>
</h5>
<ul>
<li>2000 - working through Sanger sequencing, then next gen '07</li>
<li>In order to handle different datasets - build analysis tools and pipelines and system</li>
<li>use them to this day to analyze </li>
<li>heavily integrated into Sanger infrastructure.
Now have to look at with bigger scale data</li>
</ul>
<p>Pipeline:</p>
<ul>
<li>BWA alignment</li>
<li>Tools: copy number caller - ASCAT - ins/del, rearrangements, transposon, RNA-Seq pipeline</li>
<li>generate VCF, BAM, allow researchers to get useful parts of info and drill down</li>
</ul>
<h5 id="pancancer-large-international-collaboration_5">PanCancer - large international collaboration <a class="head_anchor" href="#pancancer-large-international-collaboration_5">#</a>
</h5>
<ul>
<li>2K genome pairs (4K genomes) from multiple tumour types, 30x coverage</li>
<li>uniform dataset</li>
<li>analysed using 3 pipeline (Broad, DKFZ, Sanger)</li>
</ul>
<p>CGP -> PanCancer</p>
<ul>
<li>need to take out each part and make it Sanger free</li>
<li>optimize for different version of aligner</li>
<li>pipeline whole lot using SeqWare (O'Connor)</li>
<li>Just a few seconds - but they add up over few billion bps</li>
</ul>
<p>Phase 1</p>
<ul>
<li>identify data for upload, align each sample pair</li>
<li>using GeneTorrent to dl data from CGHub - works very well.
Personal concern was on getting data from where it was to where it needed to be.
Getting astonishing transfer rate.
Automatic data upload.</li>
</ul>
<h4 id="useful-outcomes_4">Useful outcomes <a class="head_anchor" href="#useful-outcomes_4">#</a>
</h4>
<ul>
<li>we moved over to using a version of BWA-MEM (from BWA)- significantly faster and smaller memory footprint.
May use for in-house pipelines</li>
</ul>
<p>optimized callers</p>
<ul>
<li>looked at where their code was spending time</li>
<li>made huge steps forward - substitution caller is 50% faster</li>
<li>indel caller 2x faster</li>
<li>ICGC benchmarking exercise - invaluable.
Allowed us to make much better judgements on how well we are doing</li>
<li>new sequencing technologies go faster still…</li>
</ul>
<h4 id="a-namebutlerqaq-amp-aa_4">
<a name="butler-qa">Q & A</a> <a class="head_anchor" href="#a-namebutlerqaq-amp-aa_4">#</a>
</h4>
<p><strong>Q:</strong> (Ware) <strong>interested in optimization for indels - can you push that any further? Many of our bottlenecks are in aligners built for human (work in plant)</strong></p>
<p><strong>A:</strong> What’s it written in? Perl/Java - eyes roll back in heads and they start shaking. Joking aside, with Caveman (substitution caller) - given someone the time to go back and just re-code proved to give us massive improvement. Recoded in C. Not glamorous or groundbreaking - C really is faster.</p>
<h2 id="a-hrefscheduleback-to-the-speaker-list-8594a_2">
<a href="#schedule">back to the speaker list →</a> <a class="head_anchor" href="#a-hrefscheduleback-to-the-speaker-list-8594a_2">#</a>
</h2><h2 id="a-namekasowskishort-talk-extensive-variation_2">
<a name="kasowski">Short Talk: Extensive Variation in Chromatin States Across Humans</a> <a class="head_anchor" href="#a-namekasowskishort-talk-extensive-variation_2">#</a>
</h2><h3 id="maya-m-kasowski-yale-university-usa_3">Maya M. Kasowski, Yale University, USA <a class="head_anchor" href="#maya-m-kasowski-yale-university-usa_3">#</a>
</h3><blockquote>
<h4 id="a-namekasowskiabstractabstracta_4">
<a name="kasowski-abstract">Abstract</a> <a class="head_anchor" href="#a-namekasowskiabstractabstracta_4">#</a>
</h4>
<p>The majority of disease-associated variants lie outside protein-coding regions, suggesting a link between variation in regulatory regions and disease predisposition. We studied differences in chromatin states using five histone modifications, cohesin, and CTCF in lymphoblastoid lines from 19 individuals of diverse ancestry. We found extensive signal variation in regulatory regions, which often switch between active and repressed states across individuals. Enhancer activity is particularly diverse among individuals, whereas gene expression remains relatively stable. Chromatin variability shows genetic inheritance in trios, correlates with genetic variation and population divergence, and is associated with disruptions of transcription factor binding motifs. Overall, our results provide insights into chromatin variation among humans.</p>
</blockquote><h4 id="notes_4">Notes <a class="head_anchor" href="#notes_4">#</a>
</h4>
<p>Chromatic variation among people</p>
<h5 id="what-makes-people-different_5">What makes people different? <a class="head_anchor" href="#what-makes-people-different_5">#</a>
</h5>
<ul>
<li>Level of DNA sequence - SNPs</li>
<li>But how do these variants translate to phenotypic differences</li>
<li>Look at gene expression. Look at differences in chromatin</li>
<li>Mapped NFkB</li>
</ul>
<h5 id="differences-in-histone-marks-differences-in-g_5">Differences in histone marks differences in gene expression? <a class="head_anchor" href="#differences-in-histone-marks-differences-in-g_5">#</a>
</h5>
<p>Aim:</p>
<ul>
<li>Characterize variation in chromatic state</li>
<li>Genetic basis, functional consequences</li>
</ul>
<p>Used HapMap populations - 19 individuals</p>
<ul>
<li>9-13 histone marks - deeply sequenced data</li>
<li>Convenient - powerful tool for functionally annotating genome</li>
<li>Enhancers/promoters/ etc</li>
</ul>
<h5 id="how-much-variation-in-chromatin-among-individ_5">How much variation in chromatin among individuals? <a class="head_anchor" href="#how-much-variation-in-chromatin-among-individ_5">#</a>
</h5>
<p>There’s an enhancer that is active in caucasian and 2 asians, but not africans - SNP in NFkB motif</p>
<p>Striking variation - more than 30% variation at some marks</p>
<p>Combinatorial - chromatin states based on combinations of the marks</p>
<ul>
<li>promoter states</li>
<li>transcribed states</li>
<li>variety of enhancer states</li>
<li>repressed states</li>
</ul>
<p>Found that it was more meaningful to ask whether a particular mark varies in the context in a particular state than overall</p>
<ul>
<li>looking at active enhancer mark - varied more in enhancer state than promoter state</li>
<li>state specific variability</li>
<li>enhancer states more variable than transcribed or promoted</li>
<li>repressed mark - varies more in combination with active marks than on its own</li>
</ul>
<p>Do states switch among individuals?</p>
<ul>
<li>not the case, enhancer is an enhancer across individuals. </li>
<li>some reciprocal states</li>
</ul>
<p>Genetic basis of variation</p>
<ul>
<li>Active enhancer mark - evidence of strong genetic basis.
Strong correlation to genotype to variable than non-variable</li>
<li>Family trios: heritability. found that the extent of varience in daughter correlates to parents</li>
</ul>
<p>Possible mechanism - differences in TF binding motifs</p>
<ul>
<li>Strong evidence of this</li>
<li>Link variation to specific motif disruption</li>
<li>Looked at peaks, ENCODE</li>
</ul>
<p>Functional consequences:</p>
<ul>
<li>There’s a strong correlation with gene expression (active enhancer - RNA-Seq data).
For known enhancer gene lengths (but imperfectly known)</li>
<li>Not all enhancer variation influences expression (but most of them were).
Why? - the enhancers are buffering each other.
Non-consequential enhancer variation</li>
<li>Chromatin variation is likely to influence phenotypes.
Variant regions enriched in eQTLs and GWAS SNPs</li>
</ul>
<h4 id="a-namekasowskiqaq-amp-aa_4">
<a name="kasowski-qa">Q & A</a> <a class="head_anchor" href="#a-namekasowskiqaq-amp-aa_4">#</a>
</h4>
<p><strong>Q:</strong> (Ware) <strong>epigenetic change- were you able to use those as biomarkers and retest GWAS? Uncover hidden variation?</strong></p>
<p><strong>A:</strong> Haven’t look at that. This study, 19 individuals. But as we up the scale, perhaps.</p>
<p><strong>Q: Did you look at the trios to see if there’s more concordance among their epigenetic marks than you would have expected on the basis of shared SNPs?</strong></p>
<p><strong>A:</strong> Didn’t look at that, we had two trios.</p>
<p><a href="#schedule">back to the speaker list →</a></p>
<h3 id="other-posts-in-this-series_3">Other posts in this series: <a class="head_anchor" href="#other-posts-in-this-series_3">#</a>
</h3>
<ul>
<li><a href="/big-data-in-biology">Big Data in Biology</a></li>
<li><a href="/big-data-in-biology-largescale-cancer-genomics">Big Data in Biology: Large-scale Cancer Genomics</a></li>
<li><a href="/big-data-in-biology-databases-and-clouds">Big Data in Biology: Databases and Clouds</a></li>
<li><a href="/big-data-in-biology-big-data-challenges-and-solutions-control-access-to-individual-genomes">Big Data in Biology: Big Data Challenges and Solutions: Control Access to Individual Genomes</a></li>
<li><a href="/big-data-in-biology-imagingparmacogenomics">Big Data in Biology: Imaging/Parmacogenomics</a></li>
<li><a href="/big-data-in-biology-databases-and-clouds#schatz">Big Data in Biology: The Next 10 Years of Quantitative Biology</a></li>
</ul>
tag:blog.abigailcabunoc.com,2014:Post/big-data-in-biology-databases-and-clouds2014-04-01T06:05:00-07:002014-04-01T06:05:00-07:00Big Data in Biology: Databases and Clouds<p>Series Introduction: I attended the <a href="http://www.keystonesymposia.org/14F2">Keystone Symposia Conference: Big Data in Biology</a> as the Conference Assistant last week. I set up an Etherpad during the meeting to take <a href="http://ksbigdata.titanpad.com/3">live notes</a> during the sessions. I’ve compiled all the abstracts, notes and slides (where available) here. Shout-out to <a href="https://twitter.com/dkuo">David Kuo</a> for helping edit the notes.</p>
<p><em>Warning: These notes are somewhat incomplete and mostly written in broken english</em></p>
<h3 id="other-posts-in-this-series_3">Other posts in this series: <a class="head_anchor" href="#other-posts-in-this-series_3">#</a>
</h3>
<ul>
<li><a href="/big-data-in-biology">Big Data in Biology</a></li>
<li><a href="/big-data-in-biology-largescale-cancer-genomics">Big Data in Biology: Large-scale Cancer Genomics</a></li>
<li><a href="/big-data-in-biology-big-data-challenges-and-solutions-control-access-to-individual-genomes">Big Data in Biology: Big Data Challenges and Solutions: Control Access to Individual Genomes</a></li>
<li><a href="/big-data-in-biology-personal-genomes">Big Data in Biology: Personal Genomes</a></li>
<li><a href="/big-data-in-biology-imagingparmacogenomics">Big Data in Biology: Imaging/Parmacogenomics</a></li>
<li><a href="/big-data-in-biology-databases-and-clouds#schatz">Big Data in Biology: The Next 10 Years of Quantitative Biology</a></li>
</ul>
<h1 id="databases-and-clouds_1">Databases and Clouds <a class="head_anchor" href="#databases-and-clouds_1">#</a>
</h1>
<p>Monday, March 24th, 2014 9:30am - 2:15pm<br><br>
<a href="http://ks.eventmobi.com/14f2/agenda/35704/288348">http://ks.eventmobi.com/14f2/agenda/35704/288348</a><br>
<a href="http://ks.eventmobi.com/14f2/agenda/35704/288348">http://ks.eventmobi.com/14f2/agenda/35704/288348</a></p>
<h2 id="a-nameschedulespeaker-lista_2">
<a name="schedule">Speaker list</a> <a class="head_anchor" href="#a-nameschedulespeaker-lista_2">#</a>
</h2>
<p><strong>Laura Clarke</strong>, European Bioinformatics Institute, UK<br><br>
<a href="#clarke"><em>The 1000 Genomes Project, Community Access and Management for Large Scale Public Data</em></a> -<br>
[<a href="#clarke-abstract">Abstract</a>]<br>
[<a href="#clarke-qa">Q&A</a>]</p>
<p><strong>Dan Stanzione</strong>, University of Texas at Austin, USA<br><br>
<a href="#stanzione"><em>The iPlant Collaborative: Cyberinfrastructure for 21st Century Biology</em></a> -<br>
[<a href="#stanzione-abstract">Abstract</a>]<br>
[<a href="#stanzione-qa">Q&A</a>]</p>
<p><strong>Jill P. Mesirov</strong>, Broad Institute, USA<br><br>
<a href="#mesirov"><em>GenomeSpace: A Community Web Environment for Genomic Analysis Across Diverse Bioinformatic Tools</em></a> -<br>
[<a href="#mesirov-abstract">Abstract</a>]<br>
[<a href="#mesirov-qa">Q&A</a>]</p>
<p><strong>Ronald C. Taylor</strong>, Pacific Northwest National Laboratory, USA (replaced by Francis Ouellette)<br><br>
<a href="#taylor"><em>FGED: The Functional Genomics Data Society</em></a> -<br>
[<a href="#taylor-abstract">Abstract</a>]<br>
[<a href="#taylor-qa">Q&A</a>]</p>
<p><strong>Andrew Carroll</strong>, DNAnexus, USA<br><br>
<a href="#carroll"><em>Insights from the Genomic Analysis of 10,940 Exomes and 3,751 Whole Genomes Demystifying Running at Scale and the Scientific</em></a> -<br>
[<a href="#carroll-abstract">Abstract</a>]<br>
[<a href="#carroll-qa">Q&A</a>]</p>
<p><strong>Michael Schatz</strong>, Cold Spring Harbor Laboratory, USA<br><br>
<a href="#schatz"><em>The Next 10 Years of Quantitative Biology</em></a> -<br>
[<a href="#schatz-abstract">Abstract</a>]<br>
[<a href="#schatz-qa">Q&A</a>]<br>
[<a href="http://schatzlab.cshl.edu/presentations/2014.03.24.Keystone%20BigData.pdf">slides</a>]</p>
<hr>
<h2 id="a-nameclarkethe-1000-genomes-project-communit_2">
<a name="clarke">The 1000 Genomes Project, Community Access and Management for Large Scale Public Data</a> <a class="head_anchor" href="#a-nameclarkethe-1000-genomes-project-communit_2">#</a>
</h2><h3 id="laura-clarke-european-bioinformatics-institut_3">Laura Clarke, European Bioinformatics Institute, UK <a class="head_anchor" href="#laura-clarke-european-bioinformatics-institut_3">#</a>
</h3><blockquote>
<h4 id="a-nameclarkeabstractabstracta_4">
<a name="clarke-abstract">Abstract</a> <a class="head_anchor" href="#a-nameclarkeabstractabstracta_4">#</a>
</h4>
<p>The 1000 genomes data continues to be the largest public variation resource available to the community. Providing coherent and useful resources based on this data continues to be a key goal for the project Data Coordination Center (DCC). </p>
<p>The resource now stands more than 500 Tbytes in size and nearly 500,000 files on the ftp site this presents challenges both for us to manage and for users to discovery what data we have available.</p>
<p>Here I both describe these challenges and present the solutions and tools the project has created to enable the widest level of usefulness for the 1000genomes project data.</p>
<p><a href="http://www.1000genomes.org/">http://www.1000genomes.org/</a></p>
</blockquote><h4 id="notes_4">Notes <a class="head_anchor" href="#notes_4">#</a>
</h4><h5 id="1000-genomes-project_5">1000 genomes project <a class="head_anchor" href="#1000-genomes-project_5">#</a>
</h5>
<ul>
<li>Largest human project</li>
<li>Aims:
<ul>
<li>complete a baseline of human variation</li>
<li>all variation - at 1% MAF of higher genome wide.</li>
<li>0.1%-0.5% MAF in exonic regions</li>
<li>structural variations as well as SNVs</li>
</ul>
</li>
<li>BAM and VCF formats started on this project</li>
<li>99% of all variation in an individual is already present in the public catalogue</li>
<li>sequenced 26 populations around the globe.
Started with HapMap, nhgri helped get more</li>
<li>collaboration - 10 different sequencing centres.
many analysis groups</li>
</ul>
<p>Strategy</p>
<ul>
<li>collect shotgun reads, align to reference </li>
<li>detect variations based on alignment from all samples.
statistical issues for allowing errors in sampling</li>
<li>in 2008 this was impossible at scale</li>
</ul>
<p>Analysis Approach</p>
<ul>
<li>final phase 70bp+ illumina.
take much more complicated variations and create phage genomes</li>
<li>multiple centres, multiple technologies</li>
</ul>
<p>In final phase now</p>
<ul>
<li>technologies progressed so rapidly, can change aims in the duration of the project</li>
<li>0.5 PB of data</li>
</ul>
<h5 id="challenges_5">Challenges <a class="head_anchor" href="#challenges_5">#</a>
</h5>
<p>Data Transfer</p>
<ul>
<li>FTP site growing</li>
<li>20TB 2009 – 580 TB today</li>
<li>synchronizing challenging</li>
<li>download speeds.
Aspera (propriety).
Download and upload clients</li>
</ul>
<p>Within Consortium Data Exchange</p>
<ul>
<li>Data Freezes
<ul>
<li>stable release of sequence data</li>
<li>dated sequecne index file</li>
<li>alignments based on this index</li>
<li>variant set calls created from these BAMs</li>
</ul>
</li>
<li>Machine Readable FTP Site: Text file which points to FTP</li>
<li>Standardized naming formats: used sample and population names and what programs/technologies used</li>
<li>Regular communication</li>
</ul>
<p>Public Accessibility</p>
<ul>
<li>FTP site - raw data files <a href="ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/">ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/</a>
</li>
<li>AWS Amazon Cloud</li>
<li>web site</li>
<li>ensembl browser</li>
</ul>
<p>Tools to Assist Data Use</p>
<ul>
<li>Data slicer
<ul>
<li>slicing remote BAM or VCF files</li>
<li>web front end of samtools</li>
<li>returns subsection of given file - subset by population, individual</li>
</ul>
</li>
<li>Variant Pattern Finder</li>
<li>VCF to PED: haploview (PED)</li>
<li>Ensembl Variant Effect Predictor
<ul>
<li>Predicts functional consequences of variants - SNPs, Indels, Structural Variation</li>
<li>Web & API based</li>
<li>Can provide Sift and PolyPhen, HGVS, Refseq gene name</li>
</ul>
</li>
<li>Population Allele Frequency Tool (coming soon!): range of variations</li>
</ul>
<h4 id="a-nameclarkeqaq-amp-aa_4">
<a name="clarke-qa">Q & A</a> <a class="head_anchor" href="#a-nameclarkeqaq-amp-aa_4">#</a>
</h4>
<p><strong>Q: 1000 genomes project - many 340bp all deletions without insertions?</strong></p>
<p><strong>A:</strong> Quality - false discovery rate <5%. Sturctural variant very difficult. Wasn’t sufficiently confident in structural variations that aren’t deletions - did not include in db. Structural variations will always be more limited.</p>
<p><strong>Q: Idea of a data freeze and recall - uuid, public key trust network - possible route?</strong></p>
<p><strong>A:</strong> sounds like a good idea</p>
<p><a href="#schedule">back to the speaker list →</a></p>
<hr>
<h2 id="a-namestanzionethe-iplant-collaborative-cyber_2">
<a name="stanzione">The iPlant Collaborative: Cyberinfrastructure for 21st Century Biology</a> <a class="head_anchor" href="#a-namestanzionethe-iplant-collaborative-cyber_2">#</a>
</h2><h3 id="dan-stanzione-university-of-texas-at-austin-u_3">Dan Stanzione, University of Texas at Austin, USA <a class="head_anchor" href="#dan-stanzione-university-of-texas-at-austin-u_3">#</a>
</h3><blockquote>
<h4 id="a-namestanzioneabstractabstracta_4">
<a name="stanzione-abstract">Abstract</a>: <a class="head_anchor" href="#a-namestanzioneabstractabstracta_4">#</a>
</h4>
<p>iPlant is a new kind of virtual organization, a cyberinfrastructure (CI) collaborative created to catalyze progress in computationally-based discovery in plant biology. iPlant has created a comprehensive and widely used CI, driven by community needs, and adopted by a number of large-scale informatics projects and thousands of individual users. iPlant holds more than 1.5 petabytes of user data comprising several hundred million files today, and is thus deeply involved in the “Big Data” challenges of biologists, from storing to analyzing to sharing rapidly growing amounts of data. </p>
<p>This talk will outline the iPlant CI, and discuss what iPlant is doing today to address data challenges, as well as plans for the future. The talk will also address trends the project sees in how users are handling data, and the potential technological solution on the horizon to address them. </p>
<p>iPlant is supported by the National Science Foundation via Award #DBI-1265383. </p>
<p><a href="https://www.iplantcollaborative.org/">https://www.iplantcollaborative.org/</a></p>
</blockquote><h4 id="notes_4">Notes <a class="head_anchor" href="#notes_4">#</a>
</h4>
<p>iPlant - co-director (until 8 weeks ago). Passed co-director to <a href="http://www.iplantcollaborative.org/connect/staff-collaborators/matthew-w-vaughn-phd">Matthew W. Vaughn</a></p>
<p>What is iPlant:<br>
community driven organization building cyberinfrastructure for the plant (and animal) science</p>
<h5 id="cyberinfrastructure_5">cyberinfrastructure <a class="head_anchor" href="#cyberinfrastructure_5">#</a>
</h5>
<p>combination of computing, data storage, networking and humans.<br>
to achieve some scientific goal</p>
<h5 id="iplant_5">iPlant <a class="head_anchor" href="#iplant_5">#</a>
</h5>
<ul>
<li>6th year</li>
<li>14K researchers access services or data - from ecology to epigenomics</li>
</ul>
<p>Achievements through iPlants open infrastructure</p>
<ul>
<li>BIEN - generate range maps for species</li>
<li>1KP project - 100M sequence reads - richer tree of plant data.
blast annotation</li>
<li>animal mandate - cattle/buffalo piplines</li>
<li>GWAS and more</li>
</ul>
<h5 id="iplant-services_5">iPlant Services <a class="head_anchor" href="#iplant-services_5">#</a>
</h5>
<ul>
<li>Atmosphere - on demand cloud computing:
friendly front end for cloud - web interface.
pick images.
can log in via shell to image</li>
<li>iPlant data store</li>
<li>discovery environment.
rich catalog of bioinformatics machines/tools you can choose from.
put together pipelines - gui</li>
<li>iPlant APIs: embed iplant CI capabilities</li>
<li>foundation of computation by TACC </li>
<li>TACC: one of the worlds largest data providers.
provides a comprehensive cyberinfrastructure ecosystem.
not just machines, tools, apis, team</li>
</ul>
<p>Powered by iPlant </p>
<ul>
<li>build your own informatics project!</li>
<li>rPlant - r project built on iPlant</li>
<li>araport - se iplant services</li>
</ul>
<p>Workflow Optimization and Consulting</p>
<ul>
<li>12 year analysis - down to 3 days on cluster, working with iPlant</li>
<li>Code optimization: PINT - write code in R, rewrote it done in 4h</li>
</ul>
<p><strong>Democratizing access to high-throughput genome annotation</strong></p>
<h5 id="data-store_5">Data store: <a class="head_anchor" href="#data-store_5">#</a>
</h5>
<ul>
<li>federated sources iRODS (DFC) - AWS</li>
<li>geographic replication - U of Austin and TACC</li>
<li>600 TB user data and growing<br>
700 TB Galaxy<br>
200 TB specieal projects</li>
<li>community collections</li>
<li>100GB in 27min - UCBerkley to UA</li>
<li>Evolving the Data Strategy: open file storage, few roles. iDS - some filetype detection, manual meta data tagging, elastic search</li>
<li>Scaling for team science: easy scaling when too large for laptop to open</li>
</ul>
<h5 id="big-data-observations_5">Big Data Observations <a class="head_anchor" href="#big-data-observations_5">#</a>
</h5>
<ul>
<li>About 5B files at TACC - 3.5 more than Jan 2013</li>
<li>We delete at least 300M files per month</li>
<li>About 30PB in use</li>
<li>file count and size increasing rapidly</li>
<li>95% of I/O operations don’t actually move data</li>
</ul>
<p>Soap Box</p>
<ul>
<li>Average practice is getting worse in data transfer, file i/o and programming</li>
<li>best practice- amazing! - 1,024 core job, generate 1PB in 2h, reanalyzed dozen times < day.
good user, know what they’re doing</li>
<li>worst practice - 128 core job- generated 80x metadata traffic of above job and crashed filesystem.
moving 1PB over a 10GB/s network via http will take about 1.4 years<br>
<strong>c:</strong> f=fopen(“file.txt”, “w”); <em>//3 metadata writes</em><br>
<strong>python:</strong> f=open(‘file.txt’, ‘w’) <em>// 17 metadata writes</em>
</li>
<li><strong>Cloud lets us do stupid things we do in software and run it on a large scale</strong></li>
</ul>
<p>Speed things up</p>
<ul>
<li>Technological solutions are coming that can meet demand</li>
<li>machine learning, data transfer can help speed things up. But we still need good software</li>
</ul>
<h4 id="a-namestanzioneqaq-amp-aa_4">
<a name="stanzione-qa">Q & A</a> <a class="head_anchor" href="#a-namestanzioneqaq-amp-aa_4">#</a>
</h4>
<p><strong>Q:</strong> (illumina) <strong>Are there tools to analyze applications to determine their lack of efficiencies?</strong></p>
<p><strong>A:</strong> Yes, there are. Caveats: some tools - perfexpert (tooling and analysis) - low level performance tools. Not as useful with non-low level languages. Not great for python.<br>
Build job stats on system - can tell you efficiencies of your code on their system.</p>
<p><strong>Q:</strong> (Mesirov) <strong>What’s your process on who gets to use it, who doesn’t?</strong></p>
<p><strong>A:</strong> iPlant: all resources NSF funded. some EXSEED. xrack - any open science funded researcher. Must be US and published.<br>
iPlant - will open up under 10K hours. tiers on higher use, compare with other users.</p>
<p><a href="#schedule">back to the speaker list →</a></p>
<hr>
<h2 id="a-namemesirovgenomespace-a-community-web-envi_2">
<a name="mesirov">GenomeSpace: A Community Web Environment for Genomic Analysis Across Diverse Bioinformatic Tools</a> <a class="head_anchor" href="#a-namemesirovgenomespace-a-community-web-envi_2">#</a>
</h2><h3 id="jill-p-mesirov-cio-at-broad-institute-usa_3">Jill P. Mesirov, CIO at Broad Institute, USA <a class="head_anchor" href="#jill-p-mesirov-cio-at-broad-institute-usa_3">#</a>
</h3><blockquote>
<h4 id="a-namemesirovabstractabstracta_4">
<a name="mesirov-abstract">Abstract</a> <a class="head_anchor" href="#a-namemesirovabstractabstracta_4">#</a>
</h4>
<p>Over the last two decades genomics has accelerated at an exponential pace, driven by new sequencing and other genomic technologies, promising to transform biomedical research. These data offer a new era of potential for the understanding of the basic mechanisms of disease and identification of novel treatments. Concurrently, there has been a growing emphasis on integrating all of the available data types to better inform scientific discovery. There are now thousands of bioinformatic analysis and visualization tools for this wealth of data. To leverage these tools to make biomedical discoveries, biologists must be empowered to access them and combine them in creative ways to explore their data. However, this vision has been out of reach for almost all biomedical researchers.</p>
<p>We will describe and give example applications of GenomeSpace, <a href="http://www.genomespace.org">http://www.genomespace.org</a>, an open environment that brings together a community of 14 diverse computational genomics tools and data sources, and enables scientists to easily combine their capabilities without the need to write scripts or programs. Begun as a collaboration of six core bioinformatics tools - Cytoscape (UCSD), Galaxy (Penn State University), GenePattern (Broad Institute), Genomica (Weitzmann Institute), the Integrative Genomics Viewer (Broad Institute), and the UCSC Genome Browser (UCSC) - the GenomeSpace community continues to grow. GenomeSpace features support for cloud-based data storage and analysis, multi-tool analytic workflows, automatic conversion of data formats, and ease of connecting new tools to the environment.<br>
Funding provided by NHGRI and Amazon Web Services</p>
</blockquote><h4 id="notes_4">Notes <a class="head_anchor" href="#notes_4">#</a>
</h4><h5 id="a-hrefhttpwwwgenomespaceorggenomespacea-fairl_5">
<a href="http://www.genomespace.org/">GenomeSpace</a> - fairly recent project <a class="head_anchor" href="#a-hrefhttpwwwgenomespaceorggenomespacea-fairl_5">#</a>
</h5>
<p>Background</p>
<ul>
<li>accelerated rate at which biological data acquired. enabled us to do all sorts of global analysis projects</li>
<li>Swamped by development of next gen sequencing technologies</li>
<li>availability of this data has led to progress towards goals to understand disease at the molecular level and understand the genetic basis and mechanisms for disease</li>
<li>now know 3K mendelian disease genes and 5K loci have been associated with over 6K common diseases and traits</li>
<li>ENCODE- all functional elements of genome and dark matter</li>
<li>ICGC/TCGA tumour types</li>
</ul>
<h5 id="new-trends_5">New Trends <a class="head_anchor" href="#new-trends_5">#</a>
</h5>
<ul>
<li>cost down, methods up</li>
<li>more types of data are acquired</li>
<li>miRNA, Copy Number, microRNA, epigenetic- methylation, RNAI.
more sensitive and less messy data</li>
<li>increase in integrative approaches.
leveraging all these kinds of data</li>
<li>more large-scale projects (x-lab, x-institution)</li>
<li>moved from single gene analysis -> pathway/network view.
how genes <em>really</em> work</li>
</ul>
<h5 id="what-do-we-need-to-take-advantage_5">What do we need to take advantage? <a class="head_anchor" href="#what-do-we-need-to-take-advantage_5">#</a>
</h5>
<p>integrate large data sets and multiple data types.<br>
data management/identification - how do find what helps me?</p>
<p>more complex workflows and algorithms</p>
<ul>
<li>increasing computational complexity</li>
<li>compute power demands</li>
<li>need to interoperate methods and tools</li>
<li>available and accessible to biologists:
in a more friendly way.
can’t be just the computational cadre - but whole community</li>
</ul>
<p>visualize large integrated data sets:<br>
viewers, help us look at reads and see if that call makes sense</p>
<p>validate computational results</p>
<h5 id="will-focus-on-gt-more-complex-workflowsalgori_5">Will focus on -> More complex workflows/algorithsm <a class="head_anchor" href="#will-focus-on-gt-more-complex-workflowsalgori_5">#</a>
</h5>
<ul>
<li>interoperate methods and tools</li>
<li>available to all</li>
</ul>
<p>Integrative genomics</p>
<ul>
<li>tremendous advances last 10 years</li>
<li>by integrating lots of different kinds of data</li>
</ul>
<p>Difficulty of getting these tools to work together - need to develop infrastructure.<br>
<strong>Challenge:</strong> flood of data & proliferation of tools</p>
<ul>
<li>tools don’t always play well together, want to use them all in one place</li>
<li>2012: 7-10K bioinformatics tools on the web.
just Broad - 60-70 tools. not counting internal tools</li>
<li>5K public databases</li>
<li>use case (breat cancer): 12 steps, 6 tools, 7 transitions
<ul>
<li>transitions -> data formats different between tools</li>
<li>how can we democratize this data analysis and bring to the rest of the community?</li>
</ul>
</li>
</ul>
<p>One monolithic tools OR cooperative approach</p>
<ul>
<li>lightweight layer for interoperability with automatic data transfer.
lightest weight possible - do data transfer for the users</li>
<li>leverage multiple groups and existing tools</li>
<li>access to familiar tools with usual look and feel.
so users don’t have to learn how to use them again</li>
</ul>
<h5 id="genomespace_5">GenomeSpace: <a class="head_anchor" href="#genomespace_5">#</a>
</h5>
<ul>
<li>shared vision of 6 bioinformatics tools.
get them to talk to each other very easily</li>
<li>have it live in the cloud - server in cloud.
talks to GS data sources or components</li>
<li>14 tools right now (4 or 5 on the way).
infrastructure at a place where the new tools were enabled in ~1 programmer day.
portals: access portals from genome space (eg IM)</li>
<li>Use GenomeSpace S3 storage or add your own Amazon account.
Dropbox can be connected.
in development: OpenStack & Google Drive</li>
</ul>
<h5 id="how-do-i-use-it_5">How do I use it? <a class="head_anchor" href="#how-do-i-use-it_5">#</a>
</h5>
<p>Go to cookbook for: how to build a more complex analysis,<br>
How to leverage these different tools</p>
<p>genome space recipe collection</p>
<ul>
<li>summary of what the recipe does & high level steps and tools</li>
<li>summary of workflow and steps in recipe</li>
<li>video of someone going through the recipe</li>
<li>more detail on recipe - real biological use case</li>
<li>walk through a protocol of all detailed steps</li>
<li>easy to use!</li>
</ul>
<h5 id="join-the-community-a-hrefhttpwwwgenomespaceor_5">Join the community! <a href="http://www.genomespace.org/">http://www.genomespace.org/</a> <a class="head_anchor" href="#join-the-community-a-hrefhttpwwwgenomespaceor_5">#</a>
</h5>
<p>open source, on bitbucket <a href="https://bitbucket.org/genomespace/">https://bitbucket.org/genomespace/</a></p>
<h4 id="a-namemesirovqaq-amp-aa_4">
<a name="mesirov-qa">Q & A</a> <a class="head_anchor" href="#a-namemesirovqaq-amp-aa_4">#</a>
</h4>
<p><strong>Q:</strong> (Stein) <strong>loved the recipes. Regular recipes still work 50 years later (broccoli doesn’t change). Bioinformatics paper 10 years ago will not work. How much time and effort is required to create a recipe in an environment where tools will be updated? Will it work in 5 years?</strong></p>
<p><strong>A:</strong> Tried to limit the scope of the recipes - not beginning to end paper. More simple - just 2 or 3 tools. Committed to setting up steering committee for recipe collection to keep them honest.<br>
RNASeq - many are beginning to use in their work. Yet - methods for analyzing RNASeq hasn’t been settled. Challenge they recognize. Community resource - users can report when recipes aren’t working. Go to forum.</p>
<p><strong>Q:</strong> (illumina) <strong>Data from different sources, does GenomeSpace provide info on challenges on combining different data?</strong></p>
<p><strong>A:</strong><br><br>
Can do: put warnings. Watch out for the follow… etc. People who develop these recipes much understand the workflow fairly well so they know the gotchas.<br><br>
Can’t do: cannot anticipate all the ways in which a biologist will misuse resource<br>
People mis-use tools. Try to give enough info and warning to keep the probability low.</p>
<p><strong>Q:</strong> followup: <strong>Account for differences in platforms?</strong></p>
<p><strong>A:</strong> Don’t have funding for all, but we do contact vendors. </p>
<p><strong>Q: Thank you for making something more user friendly!</strong></p>
<p><strong>Q: Clinical data - do you have the security to handle this?</strong></p>
<p><strong>A:</strong> Security that Amazon Cloud provides. New round of funding: agreed to put warnings for ppl who are uploading data. If you have data that needs to be kept private - can use your own Amazon S3/Dropbox.<br><br>
GenomeSpace does not do analysis - it’s on the tools.</p>
<p><strong>Q:</strong> (IBM - Royyuru) <strong>Reproducibility - read about a tool in a paper, but can’t reproduce. Can GenomeSpace add machine readable script to run the tool?</strong></p>
<p><strong>A:</strong> Can’t go into tools themselves - lightweight. Will talk offline.</p>
<p><a href="#schedule">back to the speaker list →</a></p>
<hr>
<h2 id="a-nametaylorfged-the-functional-genomics-data_2">
<a name="taylor">FGED: The Functional Genomics Data Society</a> <a class="head_anchor" href="#a-nametaylorfged-the-functional-genomics-data_2">#</a>
</h2><h3 id="francis-ouellette-ontario-institute-for-cance_3">Francis Ouellette, Ontario Institute for Cancer Research, Canada <a class="head_anchor" href="#francis-ouellette-ontario-institute-for-cance_3">#</a>
</h3>
<ul>
<li>(Replaced: Ronald C. Taylor, Pacific Northwest National Laboratory, USA)</li>
</ul>
<p><em>Selected on merit - not invited talk. Ron has laryngitis - Francis Ouellette is presenting slides.</em></p>
<blockquote>
<h4 id="a-nametaylorabstractabstracta_4">
<a name="taylor-abstract">Abstract</a> <a class="head_anchor" href="#a-nametaylorabstractabstracta_4">#</a>
</h4>
<p>The Functional Genomics Data Society (FGED) Society, founded in 1999 as the MGED Society, is a registered International Society that advocates for open access to genomic data sets and works towards providing concrete solutions to achieve this. Our mission is to be a positive agent of change in the effective sharing and reproducibility of functional genomic data. Our work on defining minimum information specifications for reporting data in functional genomics papers (e.g., MIAME) have already enabled large data sets to be used and reused to their greater potential in biological and medical research. The FGED Society seeks to promote mechanisms to improve the reviewing process of functional genomics publications. We also work with other organizations to develop standards for biological research data quality, annotation and exchange. We actively develop methods to facilitate the creation and use of software tools that build on these standards and allow researchers to annotate and share their data easily. We promote scientific discovery that is driven by biological research efforts in data integration and meta-analysis.</p>
<p><a href="http://fged.org/">http://fged.org/</a></p>
</blockquote><h4 id="notes_4">Notes <a class="head_anchor" href="#notes_4">#</a>
</h4>
<p>Spirit of openness - share everything</p>
<h5 id="functional-genomics-data-society-amp-its-miss_5">Functional Genomics Data Society & Its Mission <a class="head_anchor" href="#functional-genomics-data-society-amp-its-miss_5">#</a>
</h5>
<p>In the beginning there were microarrays - MGED</p>
<p>MIAME - standard for exchange raw data microarray</p>
<ul>
<li>too much to ask - researchers should publish fully documented code</li>
<li>do reviewers check these?</li>
<li>ArrayExpress and GEO have >6M high throughput assays from 30K functional genomic studies.
use MIAME, so it’s working for this group</li>
<li>Many studies have shown the reuseability of these data</li>
</ul>
<p>MINSEQE - minimal standards on nucleotide seq experiment.<br>
General description of the aim, metadata, raw reads, processed data</p>
<p>FGED Standards: big data needs standards, GFED creates and aids the development of such</p>
<p>FGED is an open society, welcome feedback, input and volunteers</p>
<h4 id="a-nametaylorqaq-amp-aa_4">
<a name="taylor-qa">Q & A</a> <a class="head_anchor" href="#a-nametaylorqaq-amp-aa_4">#</a>
</h4>
<p><strong>Q:</strong> (Stein) <strong>What is the journal policy in the continued evolution of this effort?</strong></p>
<p><strong>A:</strong> Publishers in general have very great interest and support. They are looking for things like this. PLoS - new data release policy. Publishers keen to see what community agreed upon standards are.</p>
<p><a href="#schedule">back to the speaker list →</a></p>
<hr>
<h2 id="a-namecarrollinsights-from-the-genomic-analys_2">
<a name="carroll">Insights from the Genomic Analysis of 10,940 Exomes and 3,751 Whole Genomes Demystifying Running at Scale and the Scientific Results</a> <a class="head_anchor" href="#a-namecarrollinsights-from-the-genomic-analys_2">#</a>
</h2><h3 id="andrew-carroll-dnanexus-usa_3">Andrew Carroll, DNAnexus, USA <a class="head_anchor" href="#andrew-carroll-dnanexus-usa_3">#</a>
</h3><blockquote>
<h4 id="a-namecarrollabstractabstracta_4">
<a name="carroll-abstract">Abstract</a> <a class="head_anchor" href="#a-namecarrollabstractabstracta_4">#</a>
</h4>
<p>As one of five institutions participating in the global CHARGE Consortium, the Human Genome Sequencing Center (HGSC) at Baylor College of Medicine needed a compute and data management infrastructure solution to handle the massive amount of data (3,751 whole genomes and 10,940 exomes) they would be processing for this project. The large burst computational demands for this project would have unacceptably taxed existing resources, requiring either many months of using spare capacity or forcing other users off the cluster for 4-5 weeks to complete it faster. To address this challenge, HGSC, DNAnexus, and Amazon Web Services (AWS) teamed up to deploy a cloud-based infrastructure that could handle this ultra large-scale genomic analysis project quickly and flexibly, with zero capital investment. At the project’s peak, HGSC was able to spin up more than 20,000 cores on-demand in order to run the analysis pipeline of the CHARGE data. During this period, HGSC was running one of the largest genomics analysis clusters in the world.</p>
</blockquote><h4 id="notes_4">Notes <a class="head_anchor" href="#notes_4">#</a>
</h4>
<p>DNAnexus - 2009 spin out from Stanford. Darling of sucessful startups. Apply the Cloud at scale</p>
<p>Two parts:</p>
<ol>
<li>Philosophy of the Cloud</li>
<li>Application to large project (10-11K exomes)</li>
</ol>
<h5 id="what-is-dnanexus_5">What is DNAnexus <a class="head_anchor" href="#what-is-dnanexus_5">#</a>
</h5>
<ul>
<li>scalable solution deploys on AWS (Amazon Web Services) cloud</li>
<li>handle spitting out lots of nodes, sharing data accross users</li>
<li>publish own tools - external or internal</li>
</ul>
<h5 id="scientific-vision_5">Scientific Vision: <a class="head_anchor" href="#scientific-vision_5">#</a>
</h5>
<p>Challenges looming over data @ scale</p>
<p>Science like driving</p>
<ul>
<li>car = bioinfo tool</li>
<li>these come out we can do things we couldn’t do before</li>
<li>car accidents (user error, car itself)</li>
<li>improving tools is important -> need to think about the infrastructure used to make these run</li>
</ul>
<p>Tool development - profile runtimes and cost</p>
<ul>
<li>optimize for resources (cpu, memory, bandwidth)</li>
<li>now: your tools don’t work on all platforms - configuration headaches</li>
<li>cloud: configure once, run where you want it to run</li>
</ul>
<p>Benchmarking</p>
<ul>
<li>Need good benchmark sets - prevent scientific degredation (unit test).
Know that you are correct</li>
<li>drive scientific innovation</li>
<li>extend visualizations to reach to more basic biologists.
expert bioinformaticians working with basic biologists</li>
<li>deploy at scale</li>
<li>collaboration - prevent data duplication, contribution</li>
</ul>
<p>Tool Optimization</p>
<ul>
<li>resource optimization - profile through</li>
<li>DNAnexus - waterfall view of tools! see parallelization</li>
</ul>
<p>Benchmark sets</p>
<ul>
<li>compile benchmarks and tools in a single place.
can run all tools and benchmark sets.
see differences between sets</li>
<li>Configure workflow ui - run 6 variant callers and compare.<br>
visualization - how basic biologists will access the data</li>
</ul>
<p>Collaborations</p>
<ul>
<li>managing access to data - admin, viewer, collaborator (roles).
can restrict</li>
<li>delivering the data - shipping large-scale data will always be faster and more robust than data transfer.
local sftp works for small project.
likely true forever</li>
</ul>
<h5 id="dnanexus-hgsccharge-collaboration_5">DNAnexus - HGSC-CHARGE Collaboration <a class="head_anchor" href="#dnanexus-hgsccharge-collaboration_5">#</a>
</h5>
<p>Analysis of 11K exomes and 4K whole genomes for CHARGE consortium. <br>
Comput scale and distribution of results across 300 investigators</p>
<p>Baylor: 20 HiSeqs ~25TB of sequence per month</p>
<ul>
<li>growth at an exponential rate</li>
<li>load on cluster - pretty much fully booked (w/ some planned down time) </li>
<li>Mercury DNAseq pipeline
<ul>
<li>BWA + GATK realign + variant calling</li>
<li>They took out the most computationally intensive parts of the pipeline and put in DNAnexus</li>
<li>10K exomes in 5 days</li>
<li>2K nodes, 3.5M cpu hours over 10 days</li>
</ul>
</li>
<li>How much more do you get as your increase in scale?
<ul>
<li>new variants as you increase the exome scale - plot sqrt(x)</li>
<li>as we continue to sequence more and more we are going to find more and more rare variants</li>
</ul>
</li>
<li>compared with variants found in first exome, more likely to be synonymous.
variants found in lastest 5K+ - less synonymous</li>
<li>SIFT - tolerant at first, damaging later</li>
<li>Novel - exome 1, most found in dbSNP, exome 5K+ - not found in dbsnp</li>
</ul>
<h4 id="a-namecarrollqaq-amp-aa_4">
<a name="carroll-qa">Q & A</a> <a class="head_anchor" href="#a-namecarrollqaq-amp-aa_4">#</a>
</h4>
<p><strong>Q:</strong> (Schatz) <strong>On projects like this the first half is well structured, but gets very ad-hoc by the end. How is this structured in DNAnexus for ad-hoc queries?</strong></p>
<p><strong>A:</strong> We take advantage of the expertise of the ppl working with us. Relying on the CHARGE consortium in collaboration. Directed hypothesis generated by partners. </p>
<p><strong>Q: Can you elaborate on the datasets you’re using as benchmarks?</strong></p>
<p><strong>A:</strong> An oppourtunity for the community to come together - benchmarking sets are the way to go, DNAnexus gives us an oppourity to go in this space. Not curators of benchmarks sets.</p>
<p><a href="#schedule">back to the speaker list →</a></p>
<hr>
<h2 id="a-nameschatzthe-next-10-years-of-quantitative_2">
<a name="schatz">The Next 10 Years of Quantitative Biology</a> <a class="head_anchor" href="#a-nameschatzthe-next-10-years-of-quantitative_2">#</a>
</h2><h3 id="michael-schatz-cold-spring-harbor-laboratory_3">Michael Schatz, Cold Spring Harbor Laboratory, USA <a class="head_anchor" href="#michael-schatz-cold-spring-harbor-laboratory_3">#</a>
</h3><blockquote class="short">
<h4 id="a-nameschatzabstractabstracta_4">
<a name="schatz-abstract">Abstract</a> <a class="head_anchor" href="#a-nameschatzabstractabstracta_4">#</a>
</h4>
<p>Topic change, no abstract </p>
</blockquote><h4 id="slides_4">Slides <a class="head_anchor" href="#slides_4">#</a>
</h4>
<p><a href="http://schatzlab.cshl.edu/presentations/2014.03.24.Keystone%20BigData.pdf">http://schatzlab.cshl.edu/presentations/2014.03.24.Keystone%20BigData.pdf</a></p>
<h4 id="notes_4">Notes <a class="head_anchor" href="#notes_4">#</a>
</h4><h5 id="questions-in-biology-some-broad-some-focused_5">Questions in Biology - some broad, some focused <a class="head_anchor" href="#questions-in-biology-some-broad-some-focused_5">#</a>
</h5>
<p>Interesting things about these questions - there is no single instrument that answers each of these questions</p>
<p>Answer these questions:</p>
<ul>
<li>big stack of technologies</li>
<li>raw sensors at the bottom</li>
<li>then systems, compute systems, algorithms, machine learning, > results</li>
<li>Will walk through this pyramid and see what major trends</li>
</ul>
<h5 id="bottom-tier-sensors-cost-per-genome-drives-mu_5">Bottom tier - sensors : Cost per Genome - drives much of the talks today. need scalability <a class="head_anchor" href="#bottom-tier-sensors-cost-per-genome-drives-mu_5">#</a>
</h5>
<ul>
<li>map where the major sequencing instruments are across the plant</li>
<li>interesting thing: how widely distributed they are (not like other fields)</li>
<li>worldwide capacity exceeds 15 Pbp/year… 25 Pbp/year on Jan 15 (Illumina X10 systems announcement)</li>
<li>How much is a PB: sequence human genome to 30x - 10K genomes - stack up on DVDs, 787 feet of DVDs (1/6 of a mile tall).
500 2 TB drives $500K</li>
</ul>
<p>DNA data tsunami - growth of sequencing around 3x per year</p>
<ul>
<li>not too distant future: ~1 exabyte by 2018</li>
<li>~1 zettabyte by 2024.
<ul>
<li>How big? zettabyte is 1M PB</li>
<li>stack of DVDs = 10B genomes = halfway to moon</li>
<li>YouTube and astronomy datasets - roughly ~100PB today, growing exponentially</li>
</ul>
</li>
</ul>
<p>Sequencing Centres map - will be roughly the same</p>
<ul>
<li>see widespread network of sequencing networks across the planet</li>
<li>biological sensor network nanopore - @ewanbirney <a href="https://twitter.com/ewanbirney/status/448423540472422400">https://twitter.com/ewanbirney/status/448423540472422400</a>
mobile - can embed in many remote locations (hospitals, schools, )</li>
<li>the rise of a digital immune system - Schatz.
<a href="http://www.biomedcentral.com/content/pdf/2047-217X-1-4.pdf">http://www.biomedcentral.com/content/pdf/2047-217X-1-4.pdf</a>
</li>
</ul>
<p>compression will help - need to be aggressive about throwing out data</p>
<ul>
<li>particle physics - strength here. massive amount of data produced is discarded</li>
<li>resequencing will be negligible </li>
<li>precious-ness of the data/sample: cancer is the high watermark of complexity.
in principle we may want to hold on to every read</li>
</ul>
<p>major applications: </p>
<ul>
<li>human health - where $$ available</li>
<li>widespread distributed mobile sensors</li>
<li>digital immune system - constantly monitoring what’s coming up (microbes, etc)</li>
</ul>
<h5 id="next-phase-compute-algorithms_5">Next phase - compute, algorithms <a class="head_anchor" href="#next-phase-compute-algorithms_5">#</a>
</h5>
<ul>
<li>the compute will be everywhere - Cloud</li>
<li>I had the distinction of having the first paper in PubMed that ever used AWS for sequence analysis</li>
<li>will be multi-cloud - specialized for geographic or political reasons.
centric on model organism or disease of study.
makes sense to have concentrated system</li>
</ul>
<p>compute - parallel algorithm spectrum</p>
<ul>
<li>better parallelization</li>
<li>
<strong>embarrassingly parallel</strong>: problems most easy to run on cluster.
building a city? hire 100s of crews, build in parallel</li>
<li>
<strong>loosely couple algorithms</strong>: MapReduce.
building skyscraper- can’t build every floor at the same time.
a lot of the work is independent but then is aggregated together</li>
<li>
<strong>Tightly coupled</strong>: graphs and MD simulations.
growing one massive tree - more farmers will not help.
<em>“nine women cannot make a baby in one month”</em>
</li>
</ul>
<p>Better hardware:</p>
<ul>
<li>MUMmerGPU </li>
<li>specialized hardware (GPU)</li>
</ul>
<p>Crossbow - algorithm on map reduce</p>
<ul>
<li>using many commodity computers - run algorithm in parallel (map reduce)</li>
<li>use Bowtie and SOAPsnp</li>
<li>compelling example of cloud computing in genomics.
transfer time and cost – improving</li>
<li>challenge: requires more applications!</li>
<li>each algorithm requires customization - need skilled developers</li>
</ul>
<p>PanGenome alignment and assembly</p>
<ul>
<li>shifting to paradigm where raw input is set of complete genomes</li>
<li>emerging long read sequencing technologies</li>
<li>can assemble entire microbial / yeast genomes into complete assemblies</li>
<li>could be the case we have complete human genomes - get started now</li>
<li>start with set of individual genomes - segments of genomes in graph.
get context by graph - De Bruijn graph</li>
</ul>
<p>See major informatics centers on topics</p>
<ul>
<li>moving code to data</li>
<li>driven by parallel algorithms/hardware</li>
<li>shift to large populations</li>
<li>applications: read mapping will fade out, new problems (at population level) will replace it</li>
</ul>
<h5 id="top-of-slice-results-work-at-cshl-genetics-of_5">Top of slice: Results: work at CSHL - genetics of autism <a class="head_anchor" href="#top-of-slice-results-work-at-cshl-genetics-of_5">#</a>
</h5>
<p>Sample set: 3K families - simplex families</p>
<ul>
<li>one child has autism but rest of siblings not autistic</li>
<li>sequence exomes of all individuals across families</li>
<li>what do we observe relative to siblings/parents?</li>
<li>focus: gene killing mutations.
loss of function/ specific to autistic children to find genes associated with the disease</li>
<li>identifying SNPs quite mature - GATK broad, handles biases</li>
</ul>
<p>SCALPEL - find indels from short read sequencing data</p>
<ul>
<li>combine best of alignment and assembly</li>
<li>use standard aligner to map reads to genome.
purpose of this alignment is to localize the problem (locally, not globally - one exon/region at a time)</li>
<li>extract out reads that localize to a particular part</li>
<li>on the fly assembly with de Bruijn graph</li>
<li>find end to end haplotype paths spanning graph</li>
<li>align assembled sequence to region</li>
</ul>
<p>Experimental analysis and validation</p>
<ul>
<li>selected one deep coverage exome for deep analysis</li>
<li>GATK, SCALPEL, SOAPindel</li>
<li>99% accuracy where all overlap</li>
<li>specific to SCALPEL - 77% (more than others)</li>
</ul>
<p>de novo genetics of autism - same number of mutations as siblings</p>
<ul>
<li>but gene killers - enrichment in autistic kids</li>
<li>2:1 enrichment in nonsense</li>
<li>2:1 enrichment of frameshift</li>
<li>4:1 splice site mutations</li>
<li>correlation to age of father</li>
</ul>
<p>available in bio archive, code available in SourceForge</p>
<h4 id="potential-for-big-data_4">Potential for big data <a class="head_anchor" href="#potential-for-big-data_4">#</a>
</h4>
<ul>
<li>folks from Google: flu trends in nature - 2009</li>
<li>google searches for flu like symptoms - then outbreaks occur</li>
<li>Fallacy of big data? - They’ve gotten it wrong.
‘big data hubris’ - assumption that big data are a substitute for data collection and analysis.
pipelines are extremely important</li>
<li>risks of big data - given birthday and hometown - can predict SSN with good accuracy</li>
</ul>
<h4 id="power-from-data-aggregation-champion-ourselve_4">Power from data aggregation - champion ourselves and the future <a class="head_anchor" href="#power-from-data-aggregation-champion-ourselve_4">#</a>
</h4>
<ul>
<li>mindful of risks - over-fitting, reproducibility</li>
<li>caution is prudent</li>
<li>data aggregation isn’t going to solve anything- being critical - does this make sense? continuous feedback loop</li>
</ul>
<p>What is a data scientist? Many fields. To be really successful, you need strengths, experience and expertise in these fields.</p>
<h4 id="a-nameschatzqaq-amp-aa_4">
<a name="schatz-qa">Q & A</a> <a class="head_anchor" href="#a-nameschatzqaq-amp-aa_4">#</a>
</h4>
<p><strong>Q: Observation: Talking about the sequencing coming down in price - What happens when sequencing becomes so cheap and democratized that any can do this? How do we as a community get the legislature to start thinking of these privacy concern? We need to look at this data</strong></p>
<p><strong>A:</strong> No simple answer. Part of it will come through scientific discoveries - congressmen pay attention when there’s big breakthroughs. Lobby - we need to talk to the rest of the world. Part of it going to come in reponse when there are outbreaks - when data is abused. There’s already some legislation in place so you cna’t get discriminated against for, say, insurance. But there’s implicit discriminations. Don’t know how to fix outside of education and reaching out to the next gen.</p>
<p><strong>Q:</strong> (Mesirov)</p>
<ol>
<li><strong>Congratulations: terrific meeting!</strong></li>
<li><strong>30+ years ago I heard Grace Murray Hopper speak - made a comment about how we are all going to be drowning in data. All kinds of data. I appreciated your comment on what we keep. Important: we have some kind of metric of utility - huge amounts of it not touched for long periods of time. Think about what happens with this data that is never used again. Otherwise we’re all going to drown</strong></li>
</ol>
<p><strong>A:</strong> The utility of data is certainly something to be considered. We’ve bad at estimating it. We’re all hoarders. System failing recently- can’t copy off a PB of data fast enough. Trying to assess the preciousness of data and time. Some metrics are hard to measure. I anticipate the storage vendors will get better at providing tools to assess what is on a filesystem. Tools today are crude, i hope these will improve. At the very least we can identify if there are big datasets we haven’t accessed in years</p>
<p><strong>Q:</strong> (Swedlow) <strong>At Dundee, hierarchical filesystems backed up by tape. Primary data is images and proteomics - 95% of it is not touched again 3 months later. Graph representations of sequences - we will be doing the same thing with images. Concerned with the computational cost of recalculating these graphs. How expensive will recalculation be?</strong></p>
<p><strong>A:</strong> today it’s expensive - but this is an opportunity for research. For example: at level of suffix trees - construction methods. We can dust those off and improve algorithms.</p>
<p><a href="#schedule">back to the speaker list →</a></p>
<h3 id="other-posts-in-this-series_3">Other posts in this series: <a class="head_anchor" href="#other-posts-in-this-series_3">#</a>
</h3>
<ul>
<li><a href="/big-data-in-biology">Big Data in Biology</a></li>
<li><a href="/big-data-in-biology-largescale-cancer-genomics">Big Data in Biology: Large-scale Cancer Genomics</a></li>
<li><a href="/big-data-in-biology-big-data-challenges-and-solutions-control-access-to-individual-genomes">Big Data in Biology: Big Data Challenges and Solutions: Control Access to Individual Genomes</a></li>
<li><a href="/big-data-in-biology-personal-genomes">Big Data in Biology: Personal Genomes</a></li>
<li><a href="/big-data-in-biology-imagingparmacogenomics">Big Data in Biology: Imaging/Parmacogenomics</a></li>
<li><a href="/big-data-in-biology-databases-and-clouds#schatz">Big Data in Biology: The Next 10 Years of Quantitative Biology</a></li>
</ul>