The semantic web doesn’t exist, but Pfizer’s Slater believes semantic technologies are paving the way.
By Kevin Davies and Phillips Kuhl
March 24, 2009 | The theoretical benefits of the semantic web for life sciences have been debated for a few years now (see, “Masters of the Semantic Web, Bio•IT World, October 2005), but practical examples within pharma remain scarce. Pfizer’s Ted Slater is an interesting exception. Slater heads a small group of four informatics scientists in St. Louis called the Indications and Pathways Center of Emphasis (IPCoE), which supports Pfizer research efforts in identifying and validating inflammation targets.
Slater trained as a molecular biologist, but made the mistake of buying his first computer shortly after starting his Ph.D. at UC Riverside. “72 straight hours later, I realized that I’d made a vocational error and maybe I wanted to learn more about computer science instead.” He took a Master’s in computer science, and went on to work for a string of genomics companies in the ’90s, including Sequana Therapeutics, GCG, and Paradigm Genetics. He was the founding vice president for knowledge engineering at Genstruct, joining Pfizer in 2004.
Slater tends not to use the term “semantic web”. “There just isn’t a semantic web,” he says candidly. “It doesn’t mean there won’t be one in the future, but there isn’t one now.” He prefers “semantic technologies,” so as not to provoke critics who would otherwise argue that he should move on to something else.
At CHI’s Bridging Pharma & IT conference last October*, Slater outlined a project focused on pathway data analysis that showed how structuring data in a semantic network could provide substantial benefits over traditional pathways solutions, such as automated hypothesis generation. The effort has gone from idea to reality over the past 6 months.
Using semantic technologies in this way helps to eliminate long-standing problems in informatics like data silos, where information is not interoperable with other necessary information, and data tombs, which simply make information very difficult to retrieve. Semantic technologies shift the focus from collecting information and making it safe to actually using the information in its proper context to solve research problems. “The computer should be a way of enhancing our own natural ability to reason, in the same way a bicycle enhances our ability to move ourselves around,” says Slater.
Ideally, Slater says you would like users to be able to reason with the data in silico: using “if-then rules,” let the computer generate a hypothesis, then let the scientist decide the potential implications and leverage knowledge to test whether they are supported with data.
“We constantly hear that the Holy Grail is complete data integration,” says Slater. “I have bad news—it will never happen! Users are able to set up and start building new, independent repositories of data faster than we can integrate existing data. You will never be able to get it all in one place where it is integrated and useable. The goal instead should be data that are interoperable, even if they are not integrated.”
Slater’s group helps scientists to study gene expression and signaling pathways in order to identify alternative indications for drugs in development. “There is no easy way to understand what is going on if you look at a list of 1,000 genes that are significantly up- or down-regulated,” he says. Even if those genes are mapped onto pathways using a commercial pathways tool, one is forced to work with what amounts to a reference tool. Much information is available on individual genes, but you have to try to tell a story about physiology by painstakingly going through the annotation for each gene one at a time.
An alternative approach lets the computer generate hypotheses based on available data, which would distill the range of possibilities to a few key relationships. If the data are represented correctly, you can use data from disparate databases, such that users can create ‘boutique’ knowledge bases for their own needs and easily link them together.
Adapting the familiar “triple” semantic RDF format—representing information as a subject, predicate, and object—Slater represents the data as a mathematical graph, with subject and object as nodes and the predicate (the relationship between them) as an edge. One triple’s subject can be another triple’s object, and so on, until a very large graph of everything known in some domain is created. In this format, the information can be handled with software to build inferences and test hypotheses. His group uses open source ontology development tools to build OWL ontologies and another open-source tool, Cytoscape, to view the data in graph format. For persistent storage, knowledge graphs can be managed in Oracle’s built-in RDF data model.
One of the goals in data analysis, says Slater, is to use heuristics over the knowledge bases to tell a story. Semantic representations of knowledge allow you to apply expert reasoning to experimental data, which may help explain a particular outcome and in turn suggest a testable hypothesis. “You don’t get inferences in a traditional structured database,” says Slater. “We have our share of traditional databases, and we are getting better at data warehousing. For many scientific applications, representing the data as an RDF graph and building for interoperability make the information much more usable. If the description of your problem solution ends with, ‘and then the user can query it,’ then you haven’t thought it through enough.” How you structure the information can either lock up the information in a data tomb or set it free.
The experimental system that Slater and his group have developed is called the “Pfizer Environment for Knowledge Engineering”, or PEKE. “PEKE handles all of the usual storage and querying capacities of traditional databases, but because of its architecture it has some surprising emergent properties,” says Slater. Among these are the ability to create, with just a couple of mouse clicks, new knowledge bases that essentially automatically interoperate with other PEKE knowledge bases.
Another capability of PEKE is that, because the semantics of each knowledge base are explicit in its OWL ontology rather than implicit in a relational database schema, PEKE supports knowledge bases containing any kind of knowledge with no changes to the architecture. While most PEKE knowledge bases are currently about molecular pathways, the ontology Slater uses to demonstrate how easy it is to create PEKE knowledge bases is the OWL pizza ontology from Stanford’s Protégé Team.
Slater says, “PEKE is world-class stuff. We think we can now build knowledge bases faster and cheaper than anyone else in the industry, and do much more with them once they’re built.” We may still be waiting on the semantic web, but semantic technologies are already paving the way for the next wave of informatics innovation.
This article also appeared in the March-April 2009 issue of Bio-IT World Magazine.
Subscriptions are free for qualifying individuals. Apply today.