Transcending first-generation text-mining tools, life scientists with industrial-strength software are letting computers discover unexpected connections in the scientific literature. Get ready for bibliomics.

By Mark D. Uehling

October 10, 2003 |  So you think you've found a cure for psoriasis. But first you need to check 12 million journal article abstracts on the National Library of Medicine's PubMed database, piling up at the rate of 40,000 new citations a month from 4,600 journals. You missed the latest issue of the Journal of Investigative Dermatology? PubMed's Medical Subject Headings (MeSH) vocabulary might help — a little. It has 300,000 synonyms for 19,000 basic medical terms. But if you type in "epidermopoiesis," a key concept in the MeSH entry for psoriasis, you will find ... nothing.

Can software put an end to tortured searching? Researchers and vendors say text mining in the life sciences is on the verge of a long-sought dream: distilling oceans of inchoate data into insights and hypotheses. An ordinary scientist should be able to extract exactly the right dipper out of that sea without requiring a doctorate in library science or years to pore over all 18,000 "psoriasis" hits retrieved from PubMed.

Some call this burgeoning field "bibliomics." Under the hood, the frontier of text mining relies on well-established algorithms and retrieval strategies from computer-science disciplines such as information extraction, text categorization, and natural language processing. Scientifically, the field intersects everything from semantic concerns about alternative gene names to the maturity of systems biology.

Blasting Through Bedrock at AstraZeneca 
Bio·IT World's Mark D. Uehling spoke with William Hayes, project lead for text mining and bioinformatics at AstraZeneca.

Read More 
While that shakes out, customers in industry are paying just to have problems solved. "We're in the business of bridging the gap between information and action," says Barak Pridor, CEO of ClearForest, which partnered with Celera Genomics to win the KDD Cup at the 2002 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. That competition required entrants to build a system for automatic analysis of scientific papers on the fruit fly.

Pridor believes that users in a variety of industries are fed up with the difficulty of finding what they need. "There is no single person who can read the entire literature, analyze the entire literature, and make all of the possible connections within the different pieces of literature," he says. "What people really require is synthesized information."

Like many companies working with Himalayas of text, Pridor's company has identified the life sciences as a key vertical. The rule of thumb is that in most companies, just 20 percent of all the data is in a structured format like a database. The rest, basically, is a pile of Word files stashed on a thousand desktops and servers all over God's creation. Says Pridor: "Just mining for gold is not sufficient. You also want to turn that gold into something that has utility for the end-user." His company takes text and applies "intelligent, hybrid tagging," converting the documents into XML. A license for the server to do the job is in the $100,000 to $200,000 range. Pridor says the multiplying document management solutions in the life sciences are part of the problem. Extracting value from information in Documentum, say, is another matter within pharma. Says Pridor: "They implemented these sophisticated document management systems to capture a lot of the content, and realized that except for putting it into a vault, and versioning it, and getting access to it, they were not exploiting the strategic value that is hidden inside."

On a more profound level, it's not only that computers and laboratories have produced too much information to absorb. It's that scientists have not sufficiently appreciated the ability of computers to serve as digital postdocs or medical residents.

Automatic Connections 
Christian Blaschke and Alfonso Valencia, for example, wondered if a computer program called Suiseki could automatically troll through published abstracts on the topic of protein-protein interactions. Their program, developed in the protein design group at Universidad Autonoma in Madrid, analyzed some 44,000 PubMed abstracts. Most of the interactions the computer detected were already known. But a few were brand-new — and one, between two transcription factors and the TATA box (a promotor sequence), was later verified experimentally.

Another team, in Norway, created a program called PubGene. Eivind Hovig and colleagues at the Norwegian University of Science and Technology took aim at 10 million PubMed records and produced a computer-generated gene-to-gene citation network for 13,712 genes. The study was published in the May 2001 issue of Nature Genetics. A molecular geneticist, Hovig developed the software after needing three weeks to sift through 6,000 Medline abstracts for "metast*"— and never wanting to repeat the task.

Testing the validity of their software, the Norwegians applied the program to a standard set of lymphocyte gene expression data. Of the top 50 genes linked to malignant lymphocytes, their program found 21 already in its network. "The current PubGene index may be used to identify gene networks not identified by clustering and to classify genes according to biological processes," the program's creators wrote.

Now, PubGene is offered for commercial licenses worldwide on Unix, Linux, Mac, and Windows platforms. Pfizer, deCODE Genetics, Millennium Pharmaceuticals, and Peoples Genetics have licensed it. The latest version, 2.1, has discovered 6 million associations between genes, proteins, function, GenBank accession numbers, and other pairings.

Hovig says that users seem to like the ability to lower the odds of missing these associations between genes in the literature. "It is important to define as many relevant relationships as possible within the total possible association-spaces of the many associations possible with more than 10,000 genes," he says. He adds that his software is routinely used with other statistical techniques for gene clustering, such as hierarchical clustering and self-organized maps.

And at Stanford University, geneticists and medical informaticists led by Rus Altman have developed a program called Neighbor. It can use text in scientific abstracts to determine which genes share a common biological function. Stanford tested its program by trying to find 19 known gene groups amid 1,900 bogus groups. The program found 15 (79 percent) of the real gene groups, which, while not perfect, shows that a computer can scan abstracts as well as a trained researcher can.

 Hidden connections: Ingenuity Systems software searches hundreds of thousands of articles, revealing previously obscured relationships. 

Shifting Gears 
These are just some early examples of a shift toward the integration of text mining to identify hypotheses. "We've gone from the model of where the biologists had a clear model in their heads and did an experiment to prove or disprove their mechanistic view of what's going on," says Daniel R. Masys, a physician at the University of California at San Diego, "to a model that's quite open-ended. We don't know, a priori, what those things might be. There must be tools to discover things that the investigators didn't have in mind when they did the experiment."

A former director of the National Center for Biotechnology Information (NCBI), Masys is a self-taught programmer who has personally hand-coded another promising application. In a paper in the April 2001 Bioinformatics, Masys used his software, HAPI, to parse the genes identified by the Broad Institute's Todd Golub in acute lymphoblastic leukemia (ALL) and acute myelogenous leukemia (AML). HAPI automatically generated a list of papers for both genes and corresponding references in GenBank and PubMed. Masys' software revealed that the candidate ALL genes are also linked to multiple sclerosis and inherited combined immunodeficiency.

"Those kinds of things happen in these cluster analysis experiments," Masys says. "It gets your biological sensibilities revved up to see an entirely different disease turn up than the one you were focusing on in your experiment. Computers will be able to reason by unknown pathways and connect new pathways. There are so many relationships — it is impossible to detect them and prove them. You need a computer!"

 "Large companies ... have to maintain the focus of the small teams. But they also have to maintain global consistency. It's a bit of a nightmare ... It's a major problem."

Claude Vogel, Convera
Or a company like Convera. Claude Vogel, the company's chief technology officer (and founder of Semio, a developer of information-categorization software), believes that the mountain of data is going to have to come to the scientist — that deeply ingrained ways of searching will not be altered by one application. Amgen, Johnson & Johnson, Novartis, and Wyeth are customers.

Ideally, Vogel says, scientists would like to build a vast multidimensional grid: a conceptual matrix in which one dimension might contain rows of diseases; another, a particular protein; and a third, a vast library of compounds that might interact with that protein. Finding the places where a possible therapy intersects with a protein and a given disease will require more than just indexing the data, Vogel says.

His software is already searching out the most promising junctions in such matrices. Scientists in drug discovery tend to ask predictable questions as they peer at the grids: Vogel says his tools work "so fast that you can be interactively playing and mining in your data. You get a global, holistic view of what you're doing. You can build a mental model of how your space is behaving against your query."

One of the toughest issues within a life science organization is that isolated business units need both localized eccentricity (to reflect deep domain knowledge of a disease) and global conformity (to collaborate with peers half a world away). "Large companies ... have to maintain the focus of the small teams," Vogel says. "But they also have to maintain global consistency. It's a bit of a nightmare ... It's a major problem."

The solution is not yet apparent, either with text mining generally or in the life sciences. The field is too young. But a key issue is the role and prominence of a particular disease "ontology," used not in the dictionary's philosophical sense, but meaning "controlled, structured vocabulary."

The best known is the open-source Gene Ontology, or GO, which is incorporated into tools from a variety of freeware packages such as Manatee (from The Institute for Genomic Research) or Genes2Diseases (from Peer Bork's group at the European Molecular Biology Laboratory). Coming up with a scientific ontology that a medicinal chemist and a physical chemist agree on is not trivial. Devising cross-disciplinary ontologies has proved more difficult still.

With a Little Ingenuity ... 
Even so, Ingenuity Systems is dedicated to better data management in the life sciences. Its ontology is huge — 30 times larger than anyone else's, the company says. Frank Mara, senior vice president of marketing, says that selected customers (Millennium Pharmaceuticals, Aventis, GlaxoSmithKline) know that agreeing on the same name for the same gene is a starting point that could flummox other organizations.

With no legacy reservoirs of documents, and no epochal political infighting, Millennium was able to jump into next-generation text mining ahead of its peers (garnering the company a Bio·IT 2003 Best Practices award in the process). Now Ingenuity is taking some of the tools developed with Millennium and offering them as standalone or Web-ready packages. "This is the year of literature mining," Mara says. "People are going to understand how incredibly important it is — as well as how difficult it is."

Just don't accuse Ingenuity of being a souped-up version of some generic search engine. "If you're interested in just a gene, you can go to Medline and type in the name," he says. "What you really want to do is starting to pop up to a systems biology level." Put another way, a better goal is understanding and visualizing what regulates a set of genes of interest — then clicking on that set and reading what the literature has to say about a particular gene or protein, even if it was previously not on your radar screen.

"It's not about 'Let's pull data in from 12 different databases' and then allow you to search that," Mara explains. "It's being able to structure all this information so that the inter-relation between the findings becomes exciting. It's not gene X and gene Y. It's what gene X does to gene Y."

Even the best algorithms to mine the data won't find those insights if the scientific ore hasn't been processed first. "Structure, data modeling, and enforced consistency are absolutely critical in order to run algorithms across the content," Mara says.

Mara is well aware of the GO approach to ontologies but questions whether its structure is deep and rich enough for industrial mining in pharma. "The GO ontology or, more accurately, any controlled vocabulary, is probably the most well-known from a 'brand' standpoint, though it is insufficient for creating a systems biology solution from text-mining efforts," he says. GO "lacks the richness of both number of concepts and structure." Mara touts the Ingenuity Pathways Knowledge Base as "the ideal content that gets [users] critical mass for their algorithms to run across while they tune their natural language processing (NLP) engines."

To create its knowledge base, Ingenuity invested in both manual expert curation of scientific citations and traditional programmers. The company has built a quickly searchable Mount Everest of hundreds of thousands of articles, describing millions of individually modeled relationships between proteins, genes, tissues, drugs, and diseases. In some respects, it's becoming a service business — a Lexis-Nexis for drug companies.

"We can go out and do literature mining that is directed by our customers," Mara says. "They pay us to go out and do specific publications or articles or keywords or specific areas of cancer research. We take it beyond text mining and search. And because we structure it, you can now compute across it."

That capability was sufficiently alluring to persuade Affymetrix to invest $5 million in Ingenuity and co-market its arrays with Ingenuity's ability to extract insight from them. Mara admits that the ability to do systems biology seems risky to some tradition-bound customers. "It is systems biology," he says. "That scares people."

Brute Force
The approach at Recommind is as brute-force as Ingenuity's is nuanced. Recommind's algorithms can rapidly categorize any mass of documents. Customers are coming because of growing frustration with the searching process. There is no manual curation by experts in biochemistry. "It's fully automatic," says chief technology officer Jan Puzicha. "There is no manual intervention involved. [Our system] can automatically detect concepts from text. It does so by analyzing a large number of documents."

 "It's fully automatic. There is no manual intervention involved. [Our system] can automatically detect concepts from text. It does so by analyzing a large number of documents."

Jan Puzicha, Recommind 

He says a typical project might have the algorithms digest multiple document repositories in a company — perhaps Lotus Notes, Documentum, separate file systems, Oracle databases, internal Web sites, or some combination of all of the above. The company says it can handle hundreds of millions of documents. Its algorithms can differentiate between ambiguous terms: Is "ER" an estrogen receptor, or an emergency room? Does "Alzheimer's" have an apostrophe or not? His software also efficiently sorts out that some concepts may have multiple names.

Recommind does not yet have any pharma customers, but was asked to help Medline Plus improve its search engine. Lay hypochondriacs typing in online searches for "yoga, underage drinking" or "thrombophilia, Sjogren, germline" were coming up with no hits.

Recommind's algorithms were able to "save" more than half of such failed searches. (An independent expert reviewed the results before and after Recommind's tools were added to the site.) Before Recommind got involved, 40 percent of queries produced no results. The company was able to improve the percentage of searches returning "relevant" results from 68 percent (with the old Web site) to 96 percent (with Recommind's MindServer technology).

"By analyzing the corpus, we understand the multiple meanings and deliver different results lists or resolve that ambiguity — tell the user that this is ambiguous," Puzicha says. "Now people can find content even if the documents themselves don't contain the keywords." In a sense, Recommind has coded a Zen koan: a program that can find and retrieve information that was never specifically requested.

As nearly magical as all current and next-generation text-mining capabilities may seem, they are being applied to only a fraction of the most tantalizing text: the abstract. The full, unabridged text of scientific articles is almost always locked away from the clutching paws of software. Generating those abstracts is, by definition, an art. That means that far more unexplained connections could emerge from text-mining the entire mountain of life science data, not just the summit.

Fortunately, even that is changing, thanks to new online reservoirs of insight such as the Public Library of Science and BioMed Central, where Matthew Cockerill is doing text mining with new software built into Oracle databases. Says Cockerill: "The full text articles are locked in prisons in publishers' Web sites. We make our whole corpus of data available. People can download it and work with it whenever they want." *