Biocuration 2014: Battle of the new curation methods

Biocuration is incredibly important to progress in science. The process of sorting through and annotating scientific data to make it available and searchable to the public is at the heart of the ideas behind open science. I work at WormBase because I believe in its mission to curate our knowledge of nematode biology to make it freely available to the scientific community.

The Seventh International Biocuration Conference (ISB2014) was held at the University of Toronto last week. The theme of the conference this year was “Bridging the gap between genomes and phenomes”, focusing on bringing the results of the biocuration efforts to the clinicians. However, a slightly different theme stood out to me during the meeting - the tension between different methods for improved curation.

There was a clear consensus that we’ve come to an inflection point in this field. It’s no longer worthwhile, or even possible, to have detailed manual curation for each piece of biological information. Data is being generated (see NGS) and papers are being published at a tremendous rate (>100 publications/hour). Human eyes can’t keep up.

We need to look at data as a whole. Many groups have come up with ways to automate/distribute the biocuration process, with a focus on information extraction from text. Three main approaches were presented: (dictionary based) text mining, machine learning and crowdsourcing. While biocurators are civil individuals, there’s still a sense of competition between the different methods and tools.

‘Best Tweet’ award winner at #ISB2014 by yours truly. Disclaimer: I did not create this meme. I saw @escalant3 RT it awhile ago

We’re at a period of unrest while the community is deciding how to handle this ‘Big Data’ we’re faced with. In the coming years, we’ll see best practices and standard tools emerge. In the meantime, here’s a brief overview of the work presented in this area.

 Text mining

Text mining based on a knowledge dictionary: this has been a friend of biocuration for a long time. Everyone and their uncle has a text mining tool and strategy they love and support! Most tools focused on being a first-pass on a paper or abstract to help call out/screen information before an expert curator takes a closer look. The BioCreative workshop in particular demonstrated some of the recent work in this area.

Overall, the community is generating much more usable and intuitive text mining tools meant to be used as a first-pass for biocuration. A couple tools that stood out to me:

Notably missing from the meeting: Textpresso from WormBase.

The tools in this space are getting more usable and accurate. However, they still require an expert curator to look at the results. We saw in some talks that this approach may not perform as accurately as some machine learning algorithms. I’m interested to see if the research and development focus will shift away from text mining in the coming years.

 Machine learning

Just a few years ago, machine learning algorithms weren’t preforming as well as text mining on biological data. However, with larger and larger datasets becoming normal, machine learning has begun to surpass text mining in accuracy in some literature-based curation tasks. (Gobeill, Supervised text mining for functional curation of gene products: how the data and performances have evolved in the last ten years [notes - line #582])

More researchers are writing machine learning algorithms to extract information from their data. So far, these are generally ad-hoc and highly specialized algorithms, with some exceptions (GOCat). We are beginning to see some user-centred tools powered by machine learning algorithms, and I hope to see even more in the future.


Crowdsourcing is the cool kid in this space. Science has a history of failing where Wikipedia and others have succeeded. But this meeting showed a couple promising approaches to crowdsourcing in biocuration.

Ben Good deservedly won the ‘Best Presentation’ award for his talk, Microtask crowdsourcing for disease mention annotation in Pubmed abstracts. An interesting use of the Amazon Mechanical Turk applied to science. [slides]

Apollo: less about information extraction from text, more about community genome annotation [slides]

This approach is shiny and full of potential. I think many scientists in the audience were inspired by the idea of microtask crowdsourcing in particular. While I think the ‘microtask’ of interpreting biomedical literature is unusually difficult, there are huge possibilities in this space if the right tools and approaches are developed.


These are the unsolicited opinions of one web developer with a particular interest in WormBase on the state of biocuration today. There is a lot of innovation in this space - I’m excited to see what happens in the next few years. Don’t worry about being replaced, biocurators! Even with all the automation going on, everyone agrees that biocurators are needed more than ever.

Biocuration jobs: (metadata) massage therapist, (data) wrangler, (complex data) modeler

 Further reading


Now read this

Big Data in Biology: Big Data Challenges and Solutions: Control Access to Individual Genomes

Series Introduction: I attended the Keystone Symposia Conference: Big Data in Biology as the Conference Assistant last week. I set up an Etherpad during the meeting to take live notes during the sessions. I’ve compiled all the abstracts,... Continue →