Extracting Information in Life Sciences


By Kevin Davies

Through the BioCreative competition, MITRE's Lynette Hirschman helps raise the performance and relevance of academic text mining solutions.

Dec. 17, 2007 | The MITRE Corporation is a non-profit, public-interest organization that manages three federally funded research and development centers (FFRDCs), and as part of its mission, conducts its own R&D programs. One FFRDC supports the Federal Aviation Administration, another the Internal Revenue Service. The third, and oldest, supports the Department of Defense (known as the DOD Command, Control, Communications and Intelligence FFRDC). MITRE's main sites are in Virginia and Massachusetts, with smaller sites colocated with key sponsors.

"We do not make products, we cannot compete, because that would interfere with our primary mission which is to provide unbiased advice to the government. We are a very different kind of entity," explains MITRE's Lynette Hirschman.

Hirschman joined MITRE in 1993, and works at the Center for Intelligent Information Systems (CIIS), part of the DOD FFRDC. She admits to having a "strange background:" an undergraduate degree in chemistry, a Master's in German literature, and a Ph.D. in computational linguistics. She joined MITRE after stints at Unisys, NYU, and MIT, working primarily in language understanding. 

Since 2000, Hirschman's interests have hovered "increasingly on the intersection of natural language and bioinformatics."

"Biology is a great place to apply information extraction - the ability to pull out facts from free text material - because unlike computer scientists, biologists actually read the literature!" says Hirschman. "They're very dependent on knowing who's doing what and what the state of the art is. If you can give them tools to make that easier, then you don't have to have post docs spend their first six months extracting information from the literature on their projects."

The other advantage, she says, is the rise of genomics: "Biology is a language; it's the language of DNA and protein. These are linear strings of molecules, they encode information. That is the principle you can apply to determine where to find genes and find repeating patterns - extremely similar to the techniques used in modern natural language processing."

Creative Vision
Today, Hirschman leads MITRE's efforts and know-how in information management and data mining to the life sciences community, developing tools to help researchers manage databases through text mining of the scientific literature. "MITRE tends to work at the infrastructure levels, we are typically system engineers and chief architects on large projects. On the research side, we tend to provide an architecture or evaluation," she says. "You get to leverage everybody else's progress, you get to watch and see what works... putting out the milestones for the researchers and reaping the results for the end users."

To that end, in 2003 Hirschman launched a community challenge cup for the systematic evaluation of biological text mining solutions. Her inspiration was hearing a report on CASP (the structural biology competition to evaluate methods of protein structure prediction) at a conference in 2001. Text mining systems were not directly comparable at that point, because they had not been trained on the same data sets or applied to real biological applications.

In collaboration with Spain's Alfonso Valencia and Christian Blaschke (now at bioalma), Hirschman created BioCreative: Critical Assessment of Information Extraction for Biology (with funding from the National Science Foundation). The first BioCreative workshop, in 2004 in Madrid, attracted 27 groups from 10 countries, including a few private companies, applying text mining tools to real biological challenges. "We tried to frame this as a collective endeavor to do well, as opposed to [focus on] winning," says Hirschman.

The groups could chose any or all of three tasks:

  1. Extracting gene/protein names from text;
  2. Translating those names into standardized gene identifiers for inclusion in three model organism databases (the most successful systems extracted gene names from Medline abstracts with about 80-90 percent accuracy);
  3. Finding evidence for Gene Ontology (GO) annotations for protein function, biological process, and localizations (this task was the hardest, at 20-30 percent accuracy, although follow-up results have been encouraging).

In September 2007, Hirschman presented the results of BioCreative II (see sidebar: Top Systems at BioCreative II) at the Fraunhofer Text Mining Symposium in Germany (see sidebar: The Language of Text Mining). The first two tasks in the second workshop were the same; the third was reproducing the steps of a biological database curation pipeline for protein-protein interactions, working with two databases, MINT (University of Rome) and IntAct (European Bioinformatics Institute).

The benefits of BioCreative include fostering collaborations and a detailed comparison of approaches. Says Hirschman: "One of the very interesting results in the gene mention and gene normalization tasks was if you pool the results from multiple systems, you got better results than from any one system." Even low-performing systems occasionally found gene mentions that higher-ranking systems didn't find, boosting the tally to approximately 90 percent.

In another lesson from CASP, Valencia and Hirschman are trying to make text-mining tools more widely available to biologists, and easier to insert into data pipelines. That's not a trivial undertaking, particularly in biopharma organizations (see Commercial/Open Source Pros and Cons).

Hirschman chuckles when asked about the poor participation of commercial text mining providers in BioCreative. "They have not expressed interest at this point," she says. "There is limited financial incentive unless they're sure they will do well." But a few companies including bioalma did participate in the most recent contest.

Poised for Success
Hirschman believes that "text mining is poised to become an important tool," although she acknowledges that, "The uses of text mining on the pharma side are much bigger than the research side." She was particularly impressed by a Fraunhofer symposium presentation from Novartis' Thérèse Vachon, deploying a powerful system able to index and text mine various data (see Novartis' Answer to Harry Potter, Bio-IT World, Dec 2006).

"On the research side, especially if you're dealing with database curation, the information has to be really correct if it's going to be used and reused," says Hirschman. "On the pharma side, getting partial information is OK. If you can get it cheaply, that's good. So text mining tools in some senses may be a better match for certain kinds of pharma applications than they are for database curation applications."

Hirschman continues: "Pharma is ahead of the research community, and the reason why academic groups don't use commercial tools is they're very, very expensive. They've got a good market in pharma. We need to figure out how to do a better job of leveraging the progress on the commercial side (and the commercial tools that are available), to help the research community and the publicly funded databases - which are heavily used by pharma."

"The NIH needs to be worry about this [disparity]," says Hirschman. "Expert curated biological databases provide a critical resource - NIH needs to invest in affordable tools to cut the cost of curation."
----------------------
Subscribe to Bio-IT World  magazine.

Click here to login and leave a comment.  

0 Comments

Add Comment

Text Only 2000 character limit

Page 1 of 1

White Papers & Special Reports

HP white paper image
Extreme Storage Knowledge Center
Sponsored by HP

Visit HP’s Extreme Storage Knowledge Center to find informative, complimentary white papers, case studies, videos, product information and more.  Brief overview of topics:

  • The challenges of unstructured storage and how to manage both cost-effectively and efficiently
  • Company case studies of data storage challenges that translate across pharmaceutical and biotech companies today
  • Systems that manage vast amounts of data with simple deployment, unified management, and extreme scalability at an exceptionally low price per terabyte
  • Life sciences data management; viable solutions for small and large companies to manage growing storage demands
  • Take our virtual product tour and see our storage unit from inside out


Coupa white paper 92
10 Secrets to Recession-Proof Your Business
Sponsored by Coupa


Read this white paper to discover 10 strategies smart companies deploy to recession-proof their business.
Leaders generally face hard choices on how to mange a company during an economic downturn and
behave in one of three ways:
1) “The ostrich” - Preserve the status quo/hope for the best
2) “The bull in the china shop” - Blindly cut expenses across the board
3) “The fox” - Use the downturn to make your business more effective and position it for future growth

Learn how to behave “like a fox” and use a recession as a means to pounce on emerging trends.



SGI BriefingON image
High-Performance Computing in Life Science & Education
Sponsored by SGI and Intel
The varied collection of Bio-IT World articles and insights assembled in this BriefingON examine key trends in HPC infrastructure and how researchers are putting their best computational resources to use. Provided here are stories and lessons around the effective use of high performance computing in life science. Download the BriefingON.


Life Science Webcasts & Podcasts

Medidata Solutions

Rising Clinical Trial Delays and Costs - Addressing the Cause, Not the Symptoms 

medidata podcastProtocol complexity is taking a toll on clinical study speed and efficiency: increasingly complicated and ambitious protocols are not only burdening sites and study volunteers but are also prolonging trials and increasing expenses. In response, sponsors have turned to global study placement, restructured site relationships and new site management practices, but the problem remains.

This podcast will discuss:

  • Why these responses address only the symptoms, not the underlying cause, of rising clinical trial delays and costs.
  • Results of a recent joint Tufts University / Medidata Solutions study.
  • New metrics benchmarking protocol design trends.
  • Systematic protocol design improvements and why they are essential to clinical trial performance excellence.

Speakers: Ken Getz, Senior Research Fellow at the Tufts Center for the Study of Drug Development, and Ed Seguine, General Manager, Trial Planning Solutions at Medidata.

Download Now 



More Podcasts

Job Openings

Manager, Scientific Computing & Programming
Lead SAIC-Frederick, Inc.’s Bioinformatics & Analysis Group in developing & maintaining informatics pipelines for generation/analysis of dense genotyping & next-generation sequencing data. Required:  MS or equiv.  5 yrs related experience.  Knowledge of programming/software development, high performance computing, bioinformatics, project management. Visit www.saic-frederick.com - #130019.

For reprints and/or copyright permission, please contact The YGS Group, 1808 Colonial Village Lane, Lancaster, PA;

(717) 399-1900 ext. 125, or via email to Ashley.Zander@theYGSgroup.com.