Biocuration 2014: Battle of the new curation methods
Biocuration is incredibly important to progress in science. The process of sorting through and annotating scientific data to make it available and searchable to the public is at the heart of the ideas behind open science. I work at WormBase because I believe in its mission to curate our knowledge of nematode biology to make it freely available to the scientific community.
#isb2014 great turnout pic.twitter.com/V14if4ecgr
— Paul Davis (@bayamo2003) April 7, 2014
The Seventh International Biocuration Conference (ISB2014) was held at the University of Toronto last week. The theme of the conference this year was “Bridging the gap between genomes and phenomes”, focusing on bringing the results of the biocuration efforts to the clinicians. However, a slightly different theme stood out to me during the meeting - the tension between different methods for improved curation.
There was a clear consensus that we’ve come to an inflection point in this field. It’s no longer worthwhile, or even possible, to have detailed manual curation for each piece of biological information. Data is being generated (see NGS) and papers are being published at a tremendous rate (>100 publications/hour). Human eyes can’t keep up.
“That is the slide”. Not a mistake, showing mess abundant data #isb2014 pic.twitter.com/uRG6wAShEB
— Marc RobinsonRechavi (@marc_rr) April 7, 2014
We need to look at data as a whole. Many groups have come up with ways to automate/distribute the biocuration process, with a focus on information extraction from text. Three main approaches were presented: (dictionary based) text mining, machine learning and crowdsourcing. While biocurators are civil individuals, there’s still a sense of competition between the different methods and tools.
Big Data Curation panel. My current status #ISB2014 pic.twitter.com/y2QYjX21WC
— Abigail Cabunoc (@abbycabs) April 8, 2014
‘Best Tweet’ award winner at #ISB2014 by yours truly. Disclaimer: I did not create this meme. I saw @escalant3 RT it awhile ago
We’re at a period of unrest while the community is deciding how to handle this ‘Big Data’ we’re faced with. In the coming years, we’ll see best practices and standard tools emerge. In the meantime, here’s a brief overview of the work presented in this area.
Text mining #
Text mining based on a knowledge dictionary: this has been a friend of biocuration for a long time. Everyone and their uncle has a text mining tool and strategy they love and support! Most tools focused on being a first-pass on a paper or abstract to help call out/screen information before an expert curator takes a closer look. The BioCreative workshop in particular demonstrated some of the recent work in this area.
Overall, the community is generating much more usable and intuitive text mining tools meant to be used as a first-pass for biocuration. A couple tools that stood out to me:
- PubTator: If you have a list of articles, this helps you sort through and find the most relevant publications to focus on
- Factoid - Bader Lab: Turns an abstract into an editable model of biological processes. Really nice UI, using Cytoscape.
Notably missing from the meeting: Textpresso from WormBase.
The tools in this space are getting more usable and accurate. However, they still require an expert curator to look at the results. We saw in some talks that this approach may not perform as accurately as some machine learning algorithms. I’m interested to see if the research and development focus will shift away from text mining in the coming years.
Machine learning #
Just a few years ago, machine learning algorithms weren’t preforming as well as text mining on biological data. However, with larger and larger datasets becoming normal, machine learning has begun to surpass text mining in accuracy in some literature-based curation tasks. (Gobeill, Supervised text mining for functional curation of gene products: how the data and performances have evolved in the last ten years [notes - line #582])
More researchers are writing machine learning algorithms to extract information from their data. So far, these are generally ad-hoc and highly specialized algorithms, with some exceptions (GOCat). We are beginning to see some user-centred tools powered by machine learning algorithms, and I hope to see even more in the future.
- GOCat: Offers both dictionary based and machine learning models for extracting Gene Ontology terms from text.
- GIST: using machine learning to provide improved annotations for species from reads [notes - line #130]
Crowdsourcing #
Crowdsourcing is the cool kid in this space. Science has a history of failing where Wikipedia and others have succeeded. But this meeting showed a couple promising approaches to crowdsourcing in biocuration.
Ben Good deservedly won the ‘Best Presentation’ award for his talk, Microtask crowdsourcing for disease mention annotation in Pubmed abstracts. An interesting use of the Amazon Mechanical Turk applied to science. [slides]
Apollo: less about information extraction from text, more about community genome annotation [slides]
This approach is shiny and full of potential. I think many scientists in the audience were inspired by the idea of microtask crowdsourcing in particular. While I think the ‘microtask’ of interpreting biomedical literature is unusually difficult, there are huge possibilities in this space if the right tools and approaches are developed.
Conclusion #
These are the unsolicited opinions of one web developer with a particular interest in WormBase on the state of biocuration today. There is a lot of innovation in this space - I’m excited to see what happens in the next few years. Don’t worry about being replaced, biocurators! Even with all the automation going on, everyone agrees that biocurators are needed more than ever.
#ISB2014 L Stein says there are in fact, jobs for biocurators pic.twitter.com/BGzpk21rbP
— Melissa Haendel (@ontowonka) April 9, 2014
Biocuration jobs: (metadata) massage therapist, (data) wrangler, (complex data) modeler