Dec. 17, 2007
| "Commercial [text mining] software is usually easier to use, often easier to integrate with other tools, and generally has someone to call for support issues," says Biogen Idec's William Hayes. "Academic software, on the other hand, is often leading edge but very difficult to use in a lot of cases, with limited recourse regarding bug fixes other than doing it yourself."
Academic software can be targeted to needs that are not commercially viable, and is "not often designed to interoperate with other tools," says Hayes. Of course, academic software is usually open-source, which allows for customized fixes and changes. This leads to the standard bio-IT question - pay upfront in commercial licensing fees or pay for more staff to do custom development with academic tools.
Another issue, says Hayes, is scalability. "It can be very difficult to get the commercial solutions to scale, much less academic solutions. Many of the academic solutions I've seen are not designed for scalability but only to run against a research data set."
For competitions such as BioCreative, performance requires significant tuning to a specific use case to achieve results. "A commercial entity could get bragging rights if they came in first," says Hayes, "but it doesn't result in a strong competitive edge."
Most commercial products stress "features, functionality, and subjective performance" over accuracy, says Hayes. "I'm okay with that, because accuracy is incredibly dependent on the corpus (document collection) and what one is trying to extract from the literature. It's often more the accessibility of the information we are after - interface design coupled with reasonable accuracy without requiring a degree in computational linguistics to use it - and whether, with limited resources, we can get and keep an application up and running."
Nevertheless, Hayes says there are several academic/open-source technologies integrated with commercial applications for text analytics at Biogen Idec, or under consideration. These include:
- Cytoscape for network/graph visualization of extracted information;
- Exhibit (Semantic Web Simile project) for visualizing integrated competitive intelligence data mined from the web and databases;
- iHOP for protein co-occurrence analyses (Hayes says he's also interested in a commercial application, BioVista, which compares billions of co-occurrence relations between proteins, drugs, methods, cell lines, companies, adverse events, etc. from Medline, full-text journals, and patents);
- Protege for collaborative thesauri management;
- Abner (Ab initio named entity recognition).
"We'd like to use quite a few academic tools we read about, we just don't have enough bandwidth to integrate them," says Hayes. "The biomedical text mining community needs to encourage more interoperability of tools and applications." He points to the excellent example of the Jena University Language and Information Engineering Lab (JULIE Lab; http://www.julielab.de), which wraps its tools in IBM's open-sourced UIMA text analytics framework for interoperability.
Hayes appreciates the trend of leading academic work that migrates into the commercial space, "giving us the best of both worlds."
Return to main article.