Sept. 18, 2006 | A paper in Nature Biotechnology with the eye-popping title “Global Mapping of Pharmacological Space” is assured to attract a modicum of attention. Andrew Hopkins, Gaia Paolini, and their colleagues at Pfizer Global Research and Development in Sandwich, United Kingdom, have compiled a massive virtual library of almost 5 million small molecules, which they see as instrumental to conducting truly predictive biology.
Hopkins is best known for promoting the concept of the “druggable genome” a few years ago with his colleague Colin Groom. The druggable genome signifies the portion of the human genome that encodes proteins that are directly involved in disease pathogenesis and amenable to modulation by small molecules or other therapeutic agents.
The genesis for the Pfizer KnowledgeBase began about five years ago, Hopkins told Bio-IT World, in part from the realization that cheminformatics researchers did not have the same advantages as their bioinformatics brethren.
“For the past 20 to 30 years, with the wealth of collated data resources such as the Protein Data Bank, GenBank, and so on, bioinformatics has developed into a largely public field with widespread dissemination, open-source software. The existence of these vast biological databases has enabled the creation of new fields such as structural informatics, evolutionary and comparative genomics,” says Hopkins.
“But in cheminformatics, it’s a very different position. The wealth of published chemical structure and screening data is not collated in any public, open, computer-accessible databases but scattered amongst journals articles and patents, and it was only recently that companies started to integrate their databases. So we asked, ‘How can we start thinking about integrating huge amounts of data, not just to access data on a specific compound but to data-mine all pharmacological space?’ If you really want to investigate and ask deep questions about chemical structure and biological space, you need to create these chemogenomic knowledge bases.”
As the authors open their paper, “The foundation for developing drug discovery into a knowledge-based predictive science lies, in part, in the assembly and integration of all medicinal chemistry structure-activity information.” Despite the profusion of public repositories of chemical data, including Harvard’s ChemBank Initiative and the NCI’s Screening Database, such databases are limited in comparison to various proprietary databases held by industry.
So the Pfizer group, blending researchers from departments of Knowledge Discovery, Computational Chemistry, and Research Informatics, decided to build a unified database. Gathering all the data in one place offered greater control for indexing and data retrieval and management, enabling Pfizer scientists to perform global mapping.
The underlying objective for Hopkins’ team was to build what they call “the foundation of probabilistic approaches to drug discovery.” Hopkins identifies two broad applications of the global database. One is in polypharmacology — the rational identification of drugs with multiple targets. The other is probabilistic modeling.
Hopkins explains: “We started off on the assumption that if we start with chemical structures, with all the nuances of chemical structure representation, then we could integrate screening data with a biological ontology, link to competitive intelligence and many attributes related to the compound or target, then you can really start to explore global trends and build a database for predictive science. There’s lots of advanced predictive algorithms and knowledge discovery methods that can be used in cheminformatics, but what has been missing is great data sets to build models for using those algorithms. Absence of comprehensive and integrated data for model building was one of the key drivers.”
The Pfizer data warehouse currently contains some 4.8 million chemical structures. Most of the compounds are derived from various Pfizer sources, including collections acquired during the big pharma’s mergers with Warner-Lambert and Pharmacia. Another major source was Inpharmatica’s StARLITe repository of compounds culled from more than two decades of papers in a pair of key chemistry journals. The other compounds were incorporated from a variety of public repositories. About 6 percent of the chemicals are known to be biologically active, half of which come from the public side.
“Inpharmatica’s StARLITe product was absolutely essential,” says Hopkins (see Box). “It’s a huge wealth of information. They manually curated these data.” Other companies, including GVK Bioscience and Jubilant Biosys (Bangalore, India), have gone down the same path of curating data for private commercial outfits, and Hopkins says those may well be integrated into the KnowledgeBase in the future.
The most critical part of the operation was vetting the data in Oracle staging tables with the help of SciTegic’s Pipeline Pilot prior to integrating them into the KnowledgeBase. Pipeline Pilot was also utilized to aid in data queries, manipulation, filtration, and access. Aside from canonical representations of chemical structure, the data warehouse includes SAR and assay data, disease relevance, and competitive intelligence on approved drugs and those under investigation. Their analysis suggests that more than 700 human proteins are targeted by a compound with a binding affinity less than 10 µm that satisfy Lipinski’s Rule-of-Five.
Fully one-third of the biologically active drugs in the Pfizer KnowledgeBase target more than one protein. In a quarter of those cases, compounds are active against proteins in different classes. The targets that are most attractive include cytochrome P450, protein kinases, and G-protein coupled receptors. The integration of the data enables the polypharmacology of compounds across all the targets to be represented as a network of chemical space.
Using Pipeline Pilot and some clever Bayesian algorithms, Hopkins’ team set about predicting the primary pharmacology of its compound library. In 64 percent of cases, they correctly predicted a known experimental target. “These initial studies indicate that probabilistic models built from integrated medicinal chemistry SAR data are a promising approach for predicting primary pharmacology across a large number of protein targets,” Hopkins’ group writes. Moreover, classifying compounds on the basis of basic physicochemical properties enabled one-third of ligands to be correctly assigned to their major target class.
Hopkins welcomes the growing public repositories of chemical libraries and academic screening data. “They’re great initiatives that need to be encouraged.” Hopkins says the great challenge facing the community is to increase their value for data mining by incorporating the wealth of published literature and patent data. At present there is no facility or culture to deposit published or patented pharmacological data in open, public databases, as there is for many types of biological data, such as genome sequences and protein structures. “The advantage would be to build more predictive tools, [but] you need to access the data.”
Despite a surge in data on ligands for new targets, particularly for kinases, Hopkins notes, “We have yet to see the increase in new targets with leads translating into proportionate increase in the number of approved first-in-class drugs.” But that is changing rapidly. “As an industry, we have chemical tools for a much larger fraction of the genome than we previously thought,” says Hopkins.
The Pfizer group closes its paper on an optimistic note: “As the vast majority of all drug discovery projects and clinical candidates fail the exacting criteria for safe human medicines, what we are left with are the learning and data that can contribute to the refinement of predictive models, for the benefit of all.”
Sidebar: StARLITe, Star Bright
Inpharmatica has enjoyed a long commercial relationship with Pfizer — chiefly with the company’s BioPendium proteomics resource. “They knew of us and our work,” says John Overington, Inpharmatica’s vice president of drug discovery. “The StARLITe database was a logical progression.”
StARLITe is a highly curated collection of more than 400,000 compounds (and 1.5 million data activity points) culled from more than two decades of published papers in the two most prestigious medicinal chemistry journals. “The papers have the highest quality of compounds,” says Overington. “It’s where everyone would want to publish quality medicinal chemistry.”
Work on the database began about six years ago, Overington explains, taking about three years to get just right. “We brought a very strongly controlled vocabulary to the description of the data, lots of care and attention to cleaning and curation; it’s not just raw abstraction.” That database includes protein targets, pharmacokinetic parameters, and then access to the synthetic route to most of the compounds.
“We built StARLITe to do our own drug discovery and drug design projects,” says Overington. “It’s been
successful — we’ve developed candidate molecules using that technology. We also saw an opportunity to license this technology to pharmaceutical companies.”
Following the early success of the Pfizer partnership and other StARLITe licensees, Overington says, “We’re talking to a number of companies around technology development.” Much of Inpharmatica’s current research interests, Overington says, focus on leveraging the database: “How do you leverage the thousands of years of medicinal chemistry experience in designing a new drug?” --K.D.
Paolini, G.V. et al. “Global mapping of pharmacological space.” Nat Biotechnol 24, 805-15; 2006.
Email Kevin Davies at: email@example.com.