By Hermann A.M. Mucke
September 15, 2009 | Data are the indispensable material for all analyses, but if they cannot be put into proper context and interpreted, their value is only in their potential. The famous statement by futurologist John Naisbitt, “We are drowned in data but starved for knowledge,” might have been cited too often, but nevertheless it remains fundamentally true.
The structuring process that distills information from raw data and refines it until (ideally) a full understanding of the mechanisms underlying the observations is developed can be thought of as a pyramid. Layers of abstraction are added until the grand picture emerges; however, the process ultimately remains rooted in the raw data.
Because ferreting out connections between data and thereby extracting information that is relevant for a particular purpose is reminiscent of mining (which identifies and selectively removes ore from rock so that it can be processed for the desired metal), this approach is commonly referred to as data mining or knowledge discovery.
Learning About the Unexpected
Data mining has been defined as “the nontrivial extraction of implicit, previously unknown, and potentially useful information from data” (AI Magazine. 1992;13:57–70). Indeed, exploratory analysis of large data sets without preconceived assumptions or quantitative models that look for emergent, fortuitous clustering (i.e., serendipity) is the “high realm” of data mining. In this case, the data are, to the extent possible, allowed to speak for themselves, and the analyst has to be prepared for surprises from emergent patterns.
Pattern discovery in large data sets requires a tremendous amount of methodological discipline. Most of the patterns that emerge during an exploratory analysis are meaningless, uninteresting in the context of the search goal, or even outright misleading because a correlation—even a statistically significant one—does not automatically indicate a causal connection between the factors. Only a thorough analysis of covariates and confounding variables can reveal if an actual connection exists.
Seeking Specific Signatures
The second approach to data mining is modeling according to a preexisting hypothesis, as opposed to “simple” information extraction, which aims to identify predefined specific classes of entities that contain explicit information of interest. In a broader sense, information extraction of any type is referred to as data mining as long as it includes a strong exploratory component.
In many cases, researchers already have concrete ideas about the structure of the data and the behavior of the system described by the data. This allows them to build a defined hypothesis that can be tested through a targeted analysis of the mined data, which searches for specific known (or specifically suspected) patterns. If the comparison shows significant misalignment between the expected and the observed patterns, new hypotheses can be derived from this disparity. In this case, data mining amounts essentially to an iterative process of targeted data extraction and processing.
In today’s extremely competitive environment of academic and commercial life science applications, it is of essence to know as much of the prior art as possible before it can even be determined if a project can be undertaken, and how it should be structured. Published information will both guide and confound any planned efforts, and a better analysis of the prior art will allow the self-directed efforts to be targeted more accurately.
A new Insight Pharma Report examines the emerging role of the various flavors of data mining in translational drug development (i.e., the formal stages of preclinical and clinical investigations) and pharmacovigilance (i.e., the surveillance for potential side effects in the postmarketing stage). We present brief profiles of software and service providers that cater to the pharmaceutical and biotechnology industry, and point to the directions that the use of data mining in transitional drug development and postmarketing pharmacovigilance might take during the next decade. We conclude that the use of data mining in these fields of the life science industry is dynamically emerging yet has huge potential, which will be increasingly realized during the 2010s.
Data Mining In Drug Development and Translational Medicine, by Hermann A.M. Mucke, PhD, was published by Insight Pharma Reports in July 2009. For more information, visit www.insightpharmareports.com/data_mining