Oct. 16, 2006 | The Harvard chemistry professor Frank Westheimer once said, "A month in the laboratory can often save an hour in the library." At Biogen Idec, Phoebe Roberts uses the quote as a tagline on her outgoing emails, reminding her colleagues of the immense value of the text-mining group, headed by William Hayes, that now is the biopharma's library.
"Libraries are getting severely curtailed in the pharma industry because they just haven't changed," says Hayes. "Handing out books and documents is nice, but they should be much more aggressive in delivering targeted information in an accessible format for their customers. Almost all our information comes from external sources. It's a weird dichotomy - getting rid of the group that should be transforming drug discovery."
Rainer Fuchs lured Hayes from Astrazeneca (see "Digging into Digital Quarries," Bio-IT World, October 2003, pp 38-42) because "he wasn't happy with the way the library was going," says Hayes, who now heads the Library and Information Services with Roberts and a couple of other staffers. "We rarely get anybody coming in - it's nearly all E-library," says Hayes.
The Biogen Idec library has been transformed into a den of literature informatics and text mining, with Hayes and colleagues not content just to handle complicated queries, but anxious to seek out and tackle problems researchers didn't know could be asked. "We help raise awareness of previously unanswered questions," says Roberts. "We sit in on group meetings - people won't always come to us, so approaching them and saying, 'We can answer this,' is a big deal." In the process, Hayes believes Biogen Idec management can gain a significant competitive advantage, for example identifying in-licensing opportunities.
Teams such as Hayes' are few and far between in the pharma industry. There are efforts at GSK, Pfizer, and Novartis, which has a sizeable effort dedicated to semantic search and improving internal accessibility. But Hayes says he doesn't know of any libraries that have embarked upon text mining.
Biogen Idec has some 1,500 professionals, including almost 1,000 scientists in R&D. "At the 'guru' level we haven't had much take up, but junior through senior-level scientists have been very interested," says Hayes. The gurus, he says, are local experts who aren't expecting to learn anything new from literature analyses. "But that's who you want perusing the information. It will happen, but probably after everyone else picks it up."
It Takes a Village
"The questions don't really change from company to company, just the ability to answer them," says Hayes. Many research questions tend to be comprehensive queries, such as: What are all the pathogenic processes associated with a particular target? "They're not quick, easy searches. Text mining won't get rid of people, but it makes them roughly tenfold more effective."
Delivering literature informatics requires a variety of skill sets. "It takes a village," Roberts jokes. Maintaining electronic journal subscriptions, ensuring systems stay up, personnel with a strong library background, as well as domain experts inside the library. At Biogen Idec, that role belongs to Roberts, a molecular biologist previously at Proteome. "Librarians know where information is buried and how to search for it, but don't have critical domain knowledge to evaluate the results," Hayes points out.
In addition, there are technical requirements to keeping the literature informatics group functioning. "None of the tools are plug and play," says Hayes. "Some tools are still at a high informatics expert level. A lot of what we put out is bleeding-edge production."
So what is literature informatics? Hayes puts it succinctly: "Just like every other informatics capability, you have to be able to collect data, organize it and manage it, store it, filter it, analyze it, reduce it to something consumable, and then deliver it." Text mining, he says, is "facts, relationships, entities. Think of it as pulling out concepts that are related."
Hayes' team uses several terabytes of storage to house copies of Medline, the US Patent and Trademark Office (US PTO) database, and Biomed Central. They'll need more to store the full-text archives of NIH's PubMed Central as well as EDGAR (SEC filings). "Since we service more than R&D, we're trying to pull whatever text resources we can," says Hayes.
Search and Query
Hayes' group uses many tools, but leans heavily on QUOSA for full-text search, Linguamatics for text mining, and Inforsense for workflow management.
"A couple of years ago, we did a survey of our researchers, [asking] what hurt the most?" says Roberts. "QUOSA hits on almost all the major pain points - easier access to electronic articles, federated searching, and alerting services. It's a wonderful wedding of familiar interfaces with a lot of extra turbo power behind the scenes."
"We generate sub-corpora from this, pull down the full text, then build the index or use that as a source for Linguamatics or [Inforsense] KDE analyses," says Hayes.
QUOSA, based in Brighton, Mass., was founded in 1996 by a group of Harvard and MIT mathematicians. The name originally stood for Query, Update, Organize, Share, and Analyze. Since Malcolm MacKenzie's arrival in 2000, the company has focused on life sciences. "We don't think of ourselves as a text mining software company," says CEO MacKenzie. "There's the literature retrieval and management aspect; full text search; and text mining."
The biggest concentration of QUOSA users is at the NIH, although 75 percent of the top biopharmas use it. QUOSA offers full-text searching, working with organizations' electronic journal subscriptions [see "Full Text Firewall," page 32]. "Full-text search is something that provides lots of value. Text mining has come a long way but it's still a specialist craft," says MacKenzie.
QUOSA integrates with OVID, Thomson's Web of Knowledge, and PubMed, and integrates with organization's electronic journal subscriptions. Users can then process that set through other software, such as Linguamatics. "We make grabbing the articles possible, then pass them off to other tools," says MacKenzie. "You need to have the actual article to have the raw material to feed into a text-mining system."
For groups producing predictive toxicology models or searching for information on hepatoinjury, access to the full text is essential. MacKenzie cites the example of a top-ten pharma research team looking for kinetic parameters for proteins in a particular metabolic pathway. Routine PubMed searches produced nothing, but QUOSA retrieved and indexed relevant articles providing parameters that other groups had determined for numerous protein pathways. "This saved them an enormous amount of work," says MacKenzie.
Unlike other systems, which scrape citations and abstracts off the screen, QUOSA retrieves the full PDF. "And we do it across the universe of the life science publishers," says MacKenzie.
Most of the usage - and revenues - within pharma comes from clients wanting to do searches, have their PDF collection indexed, integrated, and passed onto bibliography programs such as EndNote and Reference Manager. MacKenzie mentions an executive director of clinical research who uses QUOSA to pull PDFs. "It's not a text-mining application at all."
Some organizations create virtual libraries to facilitate lab sharing, annotation, and alerts. "There's a copyright issue in stashing articles on a server," says MacKenzie. "We have special architecture to create these virtual libraries in a copyright-compliant manner. It's seamless from the user point-of-view. You can have a browser interface into these library collections - so you don't need QUOSA on every desktop."
The NLP software from Linguamatics performs feature extractions - taking advantage of parts of speech - from the QUOSA results. "Text mining just sets you up to do data mining," says Hayes. "It may be manual data mining or it may be statistical clustering, data classification, figuring out the most interesting/ prevalent bits of information."
A major application is co-occurrence. "If two things happen a lot, it's likely to be real. We don't search for a protein and another protein, but a protein and every other protein in the literature, or protein tyrosine kinases and a set of cancer types, to see which are most linked in the literature."
Linguamatics was founded in 2001 by four Cambridge University computer scientists, who conceived the company after working on NLP while at SRI. Much of Linguamatics' business is based around customers mining the scientific literature. "But it doesn't have to be the scientific literature," says business development director Phil Hastings. "It can be any documents, there's tons of information out there in an unstructured or partly structured form."
NLP interprets words in free text and can understand meaning and context. Hastings explains: "NLP recognizes nouns, verbs, and so on - what a 'thing' or an action is. It can spot relationships between proteins in a paper in a way a standard keyword search couldn't do." Moreover, NLP can extract factual relationships and present nuggets of information without users having to download the whole document.
"With information extraction tools like Linguamatics, you run a pattern, it creates a database, so you don't have to cut-and-paste," Hayes explains. "What's unique is, Linguamatics is an agile NLP (in a class by itself), as opposed to standard NLP, such as TEMIS or GeneWays."
"With Linguamatics, you're searching for facts and relationships. You can make it highly contextualized, you can look for transcription factors and cofactors related to diabetes. With standard NLPs, you spend a lot of time building a database."
Linguamatics software offers exceptional speed and flexibility. "It combines a traditional text mining approach with a search engine approach," says Hastings. NLP queries over the entire Medline database - some 17 million abstracts - can return answers in a matter of seconds. A sample query might be: "Give me any protein that interacts with protein A in documents that mention a particular type of lung cancer." Or, "Find me people in a particular position in a company." Or, "Find me a sentence that mentions a relationship between a compound and a side effect with the associated dosage level."
Another nifty application of Linguamatics software is to map the co-occurrence of authors from PubMed. Hayes displays a map of authors showing nodes of collaboration and co-authorship. "We want to look for the supernodes to make sure we're tied down with the research community," says Hayes. A "supernode" objectively identifies domain experts. "This person is going to be a good person to know," says Hayes.
A frequent next step involves importing the Linguamatics results into a workflow management tool such as Inforsense's KDE. "This is where we can share and develop workflows," says Hayes. Applications include normalization, text mining, clustering, stemming, tagging, filtering, and picking up phrases that are over-represented in the literature.
"It's a whole additional layer of analysis, statistical or otherwise...To make sure we have the most comprehensive search, you do the search, then you iterate, look at what [terms] seems to be critical for that body of literature, and use those for your secondary search terms." Hayes calls it a "no-brainer."
Hayes' team composes a number of filters to treat the datasets, including a comprehensive protein thesaurus to remove gene symbols with only two characters, or names that are purely numbers, or separating "tumor" from "tumor necrosis factor." Such cleaning steps could be done in Perl, "but it takes much less time to do it in a dedicated workflow environment and it's much easier to tweak as we learn."
Text analytics fits handily into Inforsense's workflow paradigm, says Chief Science Officer Jonathan Sheldon. "Scientists have this common fear that what they're doing has already been done - it's been published! So they need automated methods to enable them to keep up to date."
Inforsense's Textsense has an intuitive interface that enables users to build custom text analysis applications in a drag-and-drop environment. "With the portal technology, you can wrap a complex workflow and deliver it as a simple web application to the rest of the organization - and they need know nothing about the lower level," says Sheldon.
Sheldon says KDE helps can the worlds of bioinformatics, cheminformatics and clinical informatics. "It's a nice way to bring the power of text information into the context of high-throughput approaches such as gene expression and proteomics. We put that text information into the context of the lab experiments."
Other search tools are also gaining popularity with life scientists. SPSS is being used by Michael Liebman (see "Sending Out an SPSS") and also featured in the text-mining studies of Eric Bremer, which earned a Bio-IT World Best Practices award in 2005.
Jerome Pesenti, co-founder of Vivisimo, says his company's Velocity search tool is deployed in almost a dozen top pharmas, who use it for a combination of three things. "They have some internal documents, e.g. Documentum, to search. They also use PubMed on top of collections of scientific content from articles, patents, or maybe genomic data. We also provide federated searching on top of those sources," says Pesenti.
Vivisimo's clustering capability extracts major themes and helps users identify relevant subthemes. An alert function flags new articles, which can be exported into programs such as Reference Manager. Ease of use is helping users conduct their own searches without relying on librarians. "Clustering is a text mining tool for the masses," Pesenti says. "There's no need for training. We're really putting that in the hand of the end users."
With a low-end deployment costing in "in the low hundreds of thousands of dollars," Insightful's InFact software is only for the well-heeled. Last year, Insightful inked major deals with the NIH and the US Air Force. The NIH project is to help users access biological information in databases by entering requests in explicit sentences or questions. The Air Force deal is to develop new diagnostic methods for screening personnel exposed to toxic agents in the field, and using InFact to link gene expression patterns to different interactions.
Ron Faith, Insightful's InFact VP and general manager, says one (unnamed) pharma company is using InFact for functional genomics and proteomics studies for pathway construction. The text mining results feed into Insightful's data mining products such as S-PLUS and can be combined with other data for a more holistic view. "You can consolidate multiple government databases in InFact and discover correlations in the public domain and internally by your own R&D teams," says Faith. InFact is also valuable for competitive analyses. "We can look at internal research reports and publish research and patent filings, create a heat map of diseases or areas of the body along with companies working in that area," says Faith.
Nervana's founder, CEO and Chairman, Nosa Omoigui, was frustrated in his previous job at Microsoft. "I never knew what was going on. We would find out about projects after they were canned, executives doing the same project as other divisions. The problems kept getting worse. It was impossible to keep up."
It took three years to get his text analytics technology baked and go to market. Nervana's niche is what Omoigui calls "the semantic processing of data." Major clients include MGI Pharmaceuticals, Cubist, and a pilot with Lilly. "Research productivity is at an all time low," says Omoigui. "Lots of data is still fragmented physically and semantically."
Nervana can offer an ASP service, federating public sources with internal data, or place its software behind a big pharma's firewall. Nervana will also host pharma ontologies that only company scientists can access, as well as importing external ontologies such as BioWisdom. "With Nervana, I can cross match this query with a different ontology to get multiple perspectives," Omoigui says.
Omoigui demonstrates by tracking "muscular diseases" on PubMed, and scientific literature, news stories, and scientific intelligence on the internet. Questions can be asked with different levels of granularity and emphasis. Results are returned and grouped contextually - by breaking news, headlines, best bets, and so on. A semantic search for "bone diseases" and a specific chemical on Google might yield a solitary hit, but Nervana returns 170 unique results.
A prime asset is Nervana's "drag-and-drop" capability, whereby entire documents become the search query. "This took three years to invent, it's the most difficult thing we've accomplished. No-one else has anything like this," says Omoigui. Examples might include patient health records, press releases, or patent documents. "Imagine being able to drag and drop an earnings report, a business plan, etc. with results streaming live across the desktop," gushes Omoigui.
Spreading the Word
Back at Biogen Idec, Hayes' team recognizes that part of its job is to "keep the trains running on time," as Roberts put it. His team is also trying to forge a collaboration to put a text-mining thesaurus management tool in the open where the entire community can curate it. "We'd be happy to deposit several of our lexicons for that purpose," says Hayes.
But his goals for Biogen Idec are much more ambitious. "The ideal is that there's a literature informatics researcher on every drug project team - that's the appropriate model." That model will make text mining an integral part of the drug development pipeline. In the near future, Hayes hopes researchers will approach his group with the following demands: "I want to know all the pathological processes associated with my target from the get-go. I want to know all the potential diseases that might give me alternative indications. I want to know all the right people to talk to. And if there are some in-licensing opportunities, why shouldn't I know it at the beginning of the project? They're just too resource-intensive to do without text mining technologies."
SIDEBAR: The Case for Curation
Ingenuity Systems founder and CEO Jake Leschly doesn't have anything against natural language processing (NLP) - he just thinks manual curation is better, and the tenfold increase in peer-review publications using Ingenuity's Pathway Analysis (IPA) tool in the past two years supports that contention.
"Text mining is just taking a different approach where breadth is more important," says Leschly "It'll bring a lot of stuff back that isn't necessarily structured and is probably inferior to manual curation. But for certain purposes, that may be OK."
"Our customers are thinking in terms of biology, not documents or publications," says Ingenuity's Ramon Felciano, CTO and VP of Research. "It's a fundamental difference compared to text mining approaches, which help you get to the right document."
Ingenuity has mined the biomedical literature since the late 1990s using a global team of Ph.D biologist curators, who siphon data from the premier science and medical journals using web-based ontology tools. "They're doing it 24 hours a day. They do it from full text, figure captions, diagrams, tables, etc. Many automated approaches struggle from this because they can't read a diagram," says Felciano.
IPA delivers content from some 250 journals, including 40 core journals, although customers can request certain journals be included. "We'll only model findings from these publications if they're supported by the experimental evidence," says Felciano. The curated data range from molecular biology relationships to information tying drugs to their molecular targets, both approved and investigational.
Whereas NLP might identify a binary relationships, Felciano says that, "Our curators will have 20-50 concepts associated with those findings - wild type or mutant, which cell type, was anything done to the cell, was a chemical added, which species? That "biological context" is absolutely critical if you want to look at that finding and decide, Is this finding relevant to the system that I'm studying?"
To the Contrary
But Windber Institute's Michael Liebman says that despite observing manual curation in several settings, he has formed an "admitted bias" towards NLP. "To truly develop an objective and extensible approach, one needs to use machine-based learning or NLP," Liebman says. "Regardless of training and the use of standards, manual curation cannot achieve the same level of objectivity, and potentially more important, if we are trying to identify new knowledge, the use of existing knowledge to create the standards for manual curation cannot keep up with the changes that pervade the biological and clinical sciences."
QUOSA's MacKenzie says it would be "very interesting to run the [NLP vs curation] bake-off." He says text mining comes into its own when looking broadly for liver injury or adverse events. "It's not cost effective to boil the whole ocean using the human curation approach," says MacKenzie.
In fact, Phil Hastings of Linguamatics says his company has done case studies with a customer "looking for protein interactions, comparing a manually curated database and text mining of Medline. A team of humans can never read all the articles. Text mining systematically revealed a number of relationships that weren't in the commercial database." K.D.
Email Kevin Davies.
Subscribe to Bio-IT World magazine.