Extracting Information in Life Sciences



By Kevin Davies

Through the BioCreative competition, MITRE's Lynette Hirschman helps raise the performance and relevance of academic text mining solutions.

Dec. 17, 2007 | The MITRE Corporation is a non-profit, public-interest organization that manages three federally funded research and development centers (FFRDCs), and as part of its mission, conducts its own R&D programs. One FFRDC supports the Federal Aviation Administration, another the Internal Revenue Service. The third, and oldest, supports the Department of Defense (known as the DOD Command, Control, Communications and Intelligence FFRDC). MITRE's main sites are in Virginia and Massachusetts, with smaller sites colocated with key sponsors.

"We do not make products, we cannot compete, because that would interfere with our primary mission which is to provide unbiased advice to the government. We are a very different kind of entity," explains MITRE's Lynette Hirschman.

Hirschman joined MITRE in 1993, and works at the Center for Intelligent Information Systems (CIIS), part of the DOD FFRDC. She admits to having a "strange background:" an undergraduate degree in chemistry, a Master's in German literature, and a Ph.D. in computational linguistics. She joined MITRE after stints at Unisys, NYU, and MIT, working primarily in language understanding. 

Since 2000, Hirschman's interests have hovered "increasingly on the intersection of natural language and bioinformatics."

"Biology is a great place to apply information extraction - the ability to pull out facts from free text material - because unlike computer scientists, biologists actually read the literature!" says Hirschman. "They're very dependent on knowing who's doing what and what the state of the art is. If you can give them tools to make that easier, then you don't have to have post docs spend their first six months extracting information from the literature on their projects."

The other advantage, she says, is the rise of genomics: "Biology is a language; it's the language of DNA and protein. These are linear strings of molecules, they encode information. That is the principle you can apply to determine where to find genes and find repeating patterns - extremely similar to the techniques used in modern natural language processing."

Creative Vision
Today, Hirschman leads MITRE's efforts and know-how in information management and data mining to the life sciences community, developing tools to help researchers manage databases through text mining of the scientific literature. "MITRE tends to work at the infrastructure levels, we are typically system engineers and chief architects on large projects. On the research side, we tend to provide an architecture or evaluation," she says. "You get to leverage everybody else's progress, you get to watch and see what works... putting out the milestones for the researchers and reaping the results for the end users."

To that end, in 2003 Hirschman launched a community challenge cup for the systematic evaluation of biological text mining solutions. Her inspiration was hearing a report on CASP (the structural biology competition to evaluate methods of protein structure prediction) at a conference in 2001. Text mining systems were not directly comparable at that point, because they had not been trained on the same data sets or applied to real biological applications.

In collaboration with Spain's Alfonso Valencia and Christian Blaschke (now at bioalma), Hirschman created BioCreative: Critical Assessment of Information Extraction for Biology (with funding from the National Science Foundation). The first BioCreative workshop, in 2004 in Madrid, attracted 27 groups from 10 countries, including a few private companies, applying text mining tools to real biological challenges. "We tried to frame this as a collective endeavor to do well, as opposed to [focus on] winning," says Hirschman.

The groups could chose any or all of three tasks:

  1. Extracting gene/protein names from text;
  2. Translating those names into standardized gene identifiers for inclusion in three model organism databases (the most successful systems extracted gene names from Medline abstracts with about 80-90 percent accuracy);
  3. Finding evidence for Gene Ontology (GO) annotations for protein function, biological process, and localizations (this task was the hardest, at 20-30 percent accuracy, although follow-up results have been encouraging).

In September 2007, Hirschman presented the results of BioCreative II (see sidebar: Top Systems at BioCreative II) at the Fraunhofer Text Mining Symposium in Germany (see sidebar: The Language of Text Mining). The first two tasks in the second workshop were the same; the third was reproducing the steps of a biological database curation pipeline for protein-protein interactions, working with two databases, MINT (University of Rome) and IntAct (European Bioinformatics Institute).

The benefits of BioCreative include fostering collaborations and a detailed comparison of approaches. Says Hirschman: "One of the very interesting results in the gene mention and gene normalization tasks was if you pool the results from multiple systems, you got better results than from any one system." Even low-performing systems occasionally found gene mentions that higher-ranking systems didn't find, boosting the tally to approximately 90 percent.

In another lesson from CASP, Valencia and Hirschman are trying to make text-mining tools more widely available to biologists, and easier to insert into data pipelines. That's not a trivial undertaking, particularly in biopharma organizations (see Commercial/Open Source Pros and Cons).

Hirschman chuckles when asked about the poor participation of commercial text mining providers in BioCreative. "They have not expressed interest at this point," she says. "There is limited financial incentive unless they're sure they will do well." But a few companies including bioalma did participate in the most recent contest.

Poised for Success
Hirschman believes that "text mining is poised to become an important tool," although she acknowledges that, "The uses of text mining on the pharma side are much bigger than the research side." She was particularly impressed by a Fraunhofer symposium presentation from Novartis' Thérèse Vachon, deploying a powerful system able to index and text mine various data (see Novartis' Answer to Harry Potter, Bio-IT World, Dec 2006).

"On the research side, especially if you're dealing with database curation, the information has to be really correct if it's going to be used and reused," says Hirschman. "On the pharma side, getting partial information is OK. If you can get it cheaply, that's good. So text mining tools in some senses may be a better match for certain kinds of pharma applications than they are for database curation applications."

Hirschman continues: "Pharma is ahead of the research community, and the reason why academic groups don't use commercial tools is they're very, very expensive. They've got a good market in pharma. We need to figure out how to do a better job of leveraging the progress on the commercial side (and the commercial tools that are available), to help the research community and the publicly funded databases - which are heavily used by pharma."

"The NIH needs to be worry about this [disparity]," says Hirschman. "Expert curated biological databases provide a critical resource - NIH needs to invest in affordable tools to cut the cost of curation."
----------------------
Subscribe to Bio-IT World  magazine.

Click here to login and leave a comment.  

0 Comments

Add Comment

Text Only 2000 character limit

Page 1 of 1



White Papers & Special Reports

sgi whp 2
Managing the Modern Genomics Data Flood
Sponsored by SGI

Managing and storing the perfect storm of multi-disciplined data pouring from next generation sequencers and other omics instruments is a central challenge in life sciences. Discover in this paper how the SGI ArcFiniti storage solution, optimized for unstructured genomics and life sciences data can: 

  • Reduce costs, proactively protect data integrity, and deliver the high performance I/O required for genomics data processing and analysis.  
  • Effectively manage capacities from 156TB to 1.4PB as a disk based, integrated hardware and software platform 


sgi - whp 1
Turning Genomics Data into Practical Insight
Sponsored by SGI

With worldwide sequencing capacity approaching 13 quadrillion DNA bases annually turning genomics data into knowledge is a true computational challenge. Read this paper and learn how the SGI UV coherent shared memory platform can:  

  • Speed results time while cost competitively tackling the most difficult computational problems across all omics disciplines. 
  • Push performance by scaling to extraordinary levels, up to 256 sockets (2,560 cores, 4,096 threads) per single system (one OS image). 

Provide support for up to 16TB of coherent shared memory in a single system image enabling extreme efficiency across a wide range of compute demands. 



accerlys-logo_2012_wh
New Complimentary Market Survey…
Collaborations and Communications Within Drug Discovery Research
Sponsored by Accelrys
This survey was conducted by the Cambridge Healthtech Media Group in January, 2012. It was sponsored by Accelrys related to their HEOS initiative to gather valid information around externalizing collaborative research while improving communications in the cloud. With 310 qualified industry respondents the survey findings reveal useful usage and trends patterns.  An insightful follow-on discussion and webinar related to this survey, and the HEOS by Scynexis SaaS portal is also available on the Bio-IT World website for complementary viewing.
 


Job Openings

tessella logo 
Scientific Software Engineer
Boston MA
$70,000 to $95,000
 
Apply at http://jobs.tessella.com   

oxford nanopore logo 


Early Access Collaborations ManagersClick here to find out more and apply   

Oxford Nanopore's GridION technology, VP, Sales and Marketing Click to  Apply  

For reprints and/or copyright permission, please contact  Tim McLucas, (781) 972-1342, tmclucas@healthtech.com .