Sept. 18, 2006 | With the bureaucracy endemic in the drug industry, it’s no surprise it takes a decade and about $1 billion to develop a viable drug candidate, let alone a successful drug. A ray of hope to improve this appalling situation is the industry’s increasing use of complex computer algorithms, sophisticated mathematics (statistics, data mining), huge biologic and chemical databases, and tons of commercial software to help analyze and integrate knowledge. But to what end? What can mathematics offer to speed up the drug discovery pipeline?
Researchers are increasingly using DNA and other microarrays to probe gene and cellular activity. These tools produce disparate genetic and biochemical data that must be viewed as an integrated whole. By grouping genes into categories related to chemical pathways or cellular processes, and analyzing the results in light of physiologic pathways, researchers may glean mechanisms of action and identify viable drug candidates.
Dramatic increases in the density of gene/protein surrogates on a single chip fuel the need for data warehousing and analysis. The science of bioinformatics has largely met the first challenge. Databases and software now catalogue, organize, and store vast quantities of microarray data. Examples include the GO charts, Kegg Charts, and Domain Charts available from the National Institutes of Health. How far these tools can go to actually address the main problems of biological complexity remains to be seen.
Comparative studies have resulted in new procedures for standardization to help reproduce data between laboratories and mitigate nomenclature problems. Meanwhile, mathematicians have developed algorithms to ever more quickly process the data, and statisticians produced more robust models to analyze the data and reveal further questions.
Of course, gene expression analysis is just one part of a broader picture. Mathematics and software play key roles in the analysis of high-throughput screening, genomics (including toxicogenomics), proteomics, metabolomics, and integrative pathway analysis. Excluding the mathematics used for purely algorithmic functions, most of the techniques fall under statistics and data mining.
Complex Code
In the area of high-throughput screening and toxicity studies, database tools such as Drug Matrix (Iconix Pharmaceuticals), ArrayTrack (National Center for Toxicological Research, FDA), and ToxExpress (Gene Logic) use standard statistical techniques as well as the more specialized areas of data mining, including perceptrons, genetic algorithms, support-vector machines, and self-organizing maps. All of these implementations involve complex, and sometimes highly theoretical, mathematics and complex code because of the huge databases used during computation.
One of the most actively pursued areas in drug discovery concerns classification and prediction — for example, discovery of gene sets (notice the plural!) that will robustly predict a class of disease, disease progression, or drug susceptibility. The first step — discovering significantly regulated genes — is accomplished by common statistical methodologies such as the t-test and ANOVA and has been widely implemented to address the “large p/small n” problem (too many variables, not enough samples). Including too few genes risks missing an important, perhaps vital, input; including too many obscures the true importance of the more significant contributors.
These questions are further addressed by techniques such as clustering, discriminant analysis, and principal components analysis and are implemented in software packages including Rosetta Resolver, SAS-MAS (SAS), Bioconductor (an R package), Partek Genomics Suite, and others. These tools not only predict and classify but also offer quality control features, multiple comparative statistical features, and sophisticated graphics interfaces to more fully appreciate the complexity of the data, and perhaps tease out the desired effects.
Specialized packages such as DecisionSite (Spotfire) extend graphics capabilities to large sample sizes while performing some analytic functions. Others such as Ingenuity’s Pathway Analysis connect the output of these analyses to large databases to perform the gene annotation and pathway analysis necessary to interpret the biological components of the software output. Such tools have driven rapid growth in the number of statistical and bioinformatics papers attempting to address these problems.
The results of these mathematical routines have the potential to save millions of dollars in drug development. And yet the flow of successful drugs is dwindling. The problem goes beyond bureaucracy and lies in the complexity of the problem. Scientists are layering the full complexity of the human genome on top of exceedingly complex software modeled with on-the-fly mathematics. We have yet to appreciate the full complexity of the systems under study and the mathematics needed to address it. Given the magnitude and diverse nature of the reactions occurring simultaneously, physical and chemical, as well as the constantly changing environment of these reactions, complicated by the number of interactions and irregular confounding effects of feedback loops, all exquisitely related, and events conspire to quickly destroy the statisticians cherished notion of independence.
The problem lies not with lack of data on the biology side, nor lack of imaginative techniques on the mathematical side, but rather in the techniques to integrate the two areas accurately.
John A. Wass, Ph.D., is in the Advanced Technology unit of Abbott Laboratories Global Pharmaceuticals R&D. The author’s opinions are his own. E-mail: john.wass@abbott.com.