Through the BioCreative competition, MITRE's Lynette Hirschman helps raise the performance and relevance of academic text mining solutions.
Dec. 17, 2007 | The MITRE Corporation is a non-profit, public-interest organization that manages three federally funded research and development centers (FFRDCs), and as part of its mission, conducts its own R&D programs. One FFRDC supports the Federal Aviation Administration, another the Internal Revenue Service. The third, and oldest, supports the Department of Defense (known as the DOD Command, Control, Communications and Intelligence FFRDC). MITRE's main sites are in Virginia and Massachusetts, with smaller sites colocated with key sponsors.
"We do not make products, we cannot compete, because that would interfere with our primary mission which is to provide unbiased advice to the government. We are a very different kind of entity," explains MITRE's Lynette Hirschman.
Hirschman joined MITRE in 1993, and works at the Center for Intelligent Information Systems (CIIS), part of the DOD FFRDC. She admits to having a "strange background:" an undergraduate degree in chemistry, a Master's in German literature, and a Ph.D. in computational linguistics. She joined MITRE after stints at Unisys, NYU, and MIT, working primarily in language understanding.
Since 2000, Hirschman's interests have hovered "increasingly on the intersection of natural language and bioinformatics."
"Biology is a great place to apply information extraction - the ability to pull out facts from free text material - because unlike computer scientists, biologists actually read the literature!" says Hirschman. "They're very dependent on knowing who's doing what and what the state of the art is. If you can give them tools to make that easier, then you don't have to have post docs spend their first six months extracting information from the literature on their projects."
The other advantage, she says, is the rise of genomics: "Biology is a language; it's the language of DNA and protein. These are linear strings of molecules, they encode information. That is the principle you can apply to determine where to find genes and find repeating patterns - extremely similar to the techniques used in modern natural language processing."
Today, Hirschman leads MITRE's efforts and know-how in information management and data mining to the life sciences community, developing tools to help researchers manage databases through text mining of the scientific literature. "MITRE tends to work at the infrastructure levels, we are typically system engineers and chief architects on large projects. On the research side, we tend to provide an architecture or evaluation," she says. "You get to leverage everybody else's progress, you get to watch and see what works... putting out the milestones for the researchers and reaping the results for the end users."
To that end, in 2003 Hirschman launched a community challenge cup for the systematic evaluation of biological text mining solutions. Her inspiration was hearing a report on CASP (the structural biology competition to evaluate methods of protein structure prediction) at a conference in 2001. Text mining systems were not directly comparable at that point, because they had not been trained on the same data sets or applied to real biological applications.
In collaboration with Spain's Alfonso Valencia and Christian Blaschke (now at bioalma), Hirschman created BioCreative: Critical Assessment of Information Extraction for Biology (with funding from the National Science Foundation). The first BioCreative workshop, in 2004 in Madrid, attracted 27 groups from 10 countries, including a few private companies, applying text mining tools to real biological challenges. "We tried to frame this as a collective endeavor to do well, as opposed to [focus on] winning," says Hirschman.
The groups could chose any or all of three tasks:
- Extracting gene/protein names from text;
- Translating those names into standardized gene identifiers for inclusion in three model organism databases (the most successful systems extracted gene names from Medline abstracts with about 80-90 percent accuracy);
- Finding evidence for Gene Ontology (GO) annotations for protein function, biological process, and localizations (this task was the hardest, at 20-30 percent accuracy, although follow-up results have been encouraging).
In September 2007, Hirschman presented the results of BioCreative II (see sidebar: Top Systems at BioCreative II) at the Fraunhofer Text Mining Symposium in Germany (see sidebar: The Language of Text Mining). The first two tasks in the second workshop were the same; the third was reproducing the steps of a biological database curation pipeline for protein-protein interactions, working with two databases, MINT (University of Rome) and IntAct (European Bioinformatics Institute).
The benefits of BioCreative include fostering collaborations and a detailed comparison of approaches. Says Hirschman: "One of the very interesting results in the gene mention and gene normalization tasks was if you pool the results from multiple systems, you got better results than from any one system." Even low-performing systems occasionally found gene mentions that higher-ranking systems didn't find, boosting the tally to approximately 90 percent.
In another lesson from CASP, Valencia and Hirschman are trying to make text-mining tools more widely available to biologists, and easier to insert into data pipelines. That's not a trivial undertaking, particularly in biopharma organizations (see Commercial/Open Source Pros and Cons).
Hirschman chuckles when asked about the poor participation of commercial text mining providers in BioCreative. "They have not expressed interest at this point," she says. "There is limited financial incentive unless they're sure they will do well." But a few companies including bioalma did participate in the most recent contest.
Poised for Success
Hirschman believes that "text mining is poised to become an important tool," although she acknowledges that, "The uses of text mining on the pharma side are much bigger than the research side." She was particularly impressed by a Fraunhofer symposium presentation from Novartis' Thérèse Vachon, deploying a powerful system able to index and text mine various data (see Novartis' Answer to Harry Potter, Bio-IT World, Dec 2006).
"On the research side, especially if you're dealing with database curation, the information has to be really correct if it's going to be used and reused," says Hirschman. "On the pharma side, getting partial information is OK. If you can get it cheaply, that's good. So text mining tools in some senses may be a better match for certain kinds of pharma applications than they are for database curation applications."
Hirschman continues: "Pharma is ahead of the research community, and the reason why academic groups don't use commercial tools is they're very, very expensive. They've got a good market in pharma. We need to figure out how to do a better job of leveraging the progress on the commercial side (and the commercial tools that are available), to help the research community and the publicly funded databases - which are heavily used by pharma."
"The NIH needs to be worry about this [disparity]," says Hirschman. "Expert curated biological databases provide a critical resource - NIH needs to invest in affordable tools to cut the cost of curation."
Subscribe to Bio-IT World magazine.