YouTube Facebook LinkedIn Google+ Twitter Xinginstagram rss  


By John P Helfrich

Sept. 9, 2002 | ONE OF THE MOST DRAMATIC recent advances in drug discovery has been the increase in screening capacity throughput. Throughputs have increased from 10,000 assays per year to current ultra-high-throughput levels of, in some cases, more than 100,000 assays per day. High-throughput screening (HTS) is also evolving to include more target identification and validation in addition to lead identification and optimization. Target identification and validation is achieved either within the initial screens or through collaborative downstream, high-throughput ADME (absorption, distribution, metabolism, elimination/excretion) and toxicity testing. The common element throughout this process is the enormous amount of data that must be processed, correlated, and communicated in order to reach the ultimate decision — to proceed or to stop.

Drug discovery is also fast becoming a "parallel process" where the goal is early lead identification and optimization. Levels of lead compound attrition in development are around 80 to 90 percent and account for 70 percent of the cost of R&D and drug development; this compares with 40 percent and 30 percent, respectively, for discovery research. Clearly identifying more high-quality leads in early discovery is a necessity. Now that combinatorial chemistry programs and proteomic-based targets have taken hold to create an ultra-HTS environment, the lab must effectively manage all the disparate data streams that support discovery decisions and knowledge transfer throughout the life of a quality drug candidate in clinical trials.

This "data avalanche" will likely grow exponentially over the next decade, fueled by advances in modern ultra-HTS tools and the drive to move from 96-well plates to 384- and 1536-well plates. Factor in the thousands of new protein targets revealed by proteomics, and the number of data points per unit time escalates even further. Indeed, one pharmaceutical company predicted in the summer of 2001 a 40-fold increase in the number of data points in HTS — from about 4 million in 1998 to more than 150 million this year.

This knowledge base of data contains large volumes of analytical information from HTS, as well as

 ALL TOGETHER NOW: A 384-well HTS plate assay stored and catalogued in the NuGenesis SDMS database is visualized using the DecisionSite software from Spotfire Inc.
structure and purity data, determined by technologies like HPLC/mass spectrometry and NMR spectroscopy. Target validation data — from gel electrophoresis and microarrays — are also increasingly common. Because each instrument relies on its own proprietary software, the challenge is to find an application-independent storage system to aggregate all the disparate data sources and provide a means to electronically store, retrieve, extract, and communicate the data to colleagues at the bench and in the boardroom. This aggregation used to be performed manually, by cutting, copying, pasting, and scanning paper documents. The information would then be collated at the scientists' desks, stored in spreadsheets, and so on. More recently, departments can save the "flat" files within a LIMS (laboratory information management system) or enterprise data system. But finding and sharing the data is still cumbersome, slowing the interpretation of raw data into information to support critical decisions.

To understand and address this challenge, we at NuGenesis are conducting a survey of scientists involved in drug discovery within the biopharmaceutical industry regarding their data management practices. Based on preliminary results from 137 respondents, we observe the following:

  • 88 percent of respondents expect an increase in data management needs;

  • 82 percent are involved with collaborations outside the immediate corporation;

  • 56 percent store data on paper or a local PC; only one-third use a commercially supplied data system.

Our results illustrate that current data management practices in the drug discovery arena are fragmented at best. The trend, with respect to increases in the amount of data and the overwhelming collaborative nature of today's research efforts, emphasizes the need for an effective and comprehensive data capture and communication-friendly data management system.

Data Management Specification Issues 
Existing data system solutions are often too rigid to allow automated collection, archiving, retrieval, and transfer of disparate instrument-readable data (e.g. plate readers, LC/MS, and NMR) as well as human-readable data (instrument reports and business software). Ideally all data sources in the lab would be seamlessly unified into a common repository that could also integrate with other niche software tools, such as statistics and visualization packages, to provide a secure, convenient, and rapid framework for data access and storage.

A major problem is that most scientists actually work with printed reports produced by modern analytical instrumentation — termed "human-readable data." These reports contain the processed raw binary data results that a scientist can evaluate. Both vendor-specific, instrument-generated reports (i.e., LC/MS printouts) and human-generated reports (spreadsheets and presentations) containing parts of the machine-generated report are archived by the data management platform and securely disseminated among other team members internationally through Web-enabled processes. Equally as important is the ability to find text, even embedded within graphics.

The new discovery process consists of many parallel scientific disciplines all converging on a final goal — to find and characterize the next new chemical entity for clinical trials faster and cheaper than the last one using qualified information from vast amounts of high-throughput raw and processed data. The current HTS environment will continue to increase the number of data points per unit time to new heights. Today, many drug discovery operations lack the data automation to effectively capture and catalog all the necessary data; we found that 75 percent of drug discovery departments have no systematic data management strategy in operation.

There is a clear need for a data management platform that automatically captures both instrument- and human-created data to allow authorized team members to view, use, and communicate this information on a global basis. This platform must integrate directly within existing IT environments (LIMS and enterprise data systems) and be deployed on a lab, department, site, or multi-site basis.

The ability to aggregate all data sources to communicate results will become critical for deciding the fate of a new chemical entity. Having an optimal data management platform is a fundamental requirement for effective knowledge transfer in the modern high-throughput drug discovery and development field.* 

John P. Helfrich is the program manager of the drug discovery and development group at NuGenesis Technologies Corp. in Westborough, Mass., and can be reached at

For reprints and/or copyright permission, please contact Angela Parsons, 781.972.5467.