Predictive Analysis Is Data Mining’s Future

While it took human society more than 300,000 years to create 12 exabytes (1 billion gigabytes) of data, we are expected to double that amount in the next three years, according to the School of Information Management and Systems at the University of California Berkeley. [1] In the life sciences realm, Medline, alone, adds between 1,500 and 3,500 references to its database each day. [2]

Researchers, as well as healthcare practitioners, often feel as though they are drowning in a sea of gathered data that continues to swell. To help the life sciences community sustain the pace of discovery -- and deliver next-generation treatments more quickly and affordably -- the industry is embracing advanced data-mining techniques and technologies.

Data mining refers to “the semiautomatic exploration and analysis of large quantities of data in order to discover meaningful patterns and rules.” [3] It uses statistical analysis, artificial intelligence, and machine learning technologies to identify patterns that could not be found by manual analysis alone.

Early data-mining techniques -- associated with statistical theory and rudimentary artificial intelligence algorithms -- often yielded disappointing results. One could blame it on several factors: unnecessarily complex tools, lack of data-mining knowledge, and insufficient publicly visible successes. Continued maturation of the technology in recent years, however, has expanded its capabilities and stature as a vital tool in the drug discovery process. New data-mining applications feature expanded analytics, user-friendly interfaces, and powerful algorithms that allow researchers to analyze structured and unstructured data. “With training, data mining can now be done on a personal computer with a number of different commercial and open source data mining software packages.” [4] Today, scientific researchers can mine their data to find new factors, new promising leads and targets, and new medical insights, without requiring a second Ph.D. in data mining.

Progress continues at an amazing pace. In this article, we look at three data-mining advances that are poised to transform life science discovery: predictive analytics; improved unstructured text data mining; and semantic and image data mining.

Predicting Future Events

In its nascent years, data mining was used primarily for unsupervised learning, in which the goal of the analysis is to discover patterns or trends without making assumptions about the structure of the data. [5] Unsupervised learning involves identifying clusters (groups of facts not previously known) and associations (patterns in which one event is in some way linked to another). The earliest examples often dealt with esoteric problems, such as finding clusters of stars or hidden associations that were previously unknown. Data mining was often labeled as experimental and unproven.

Things have changed dramatically. Today, a researcher, for example, might use data mining to find clusters of disease subtypes in hope of finding subtypes to focus on that specific target or hopefully enable a more precise treatment course. Attribute-importance algorithms now help researchers, for instance, select the subset of genes most likely used in discriminating types of cancer. [4]

At the last Oracle Life Sciences User Group held last month in Boston in conjunction with the 2005 Bio-IT World Conference + Expo, presentations on data-mining implementations addressed a wide range of topics, including clustering proteins; finding factors in clinical settings associated with better medical outcomes, such as the analysis of unstructured text physician’s notes; the analyses of the mosaic nature of genomes using phylogenetics; and clustering and classification of Medline documents.

“The future of data mining lies in predictive analytics,” declares Forrester Research analyst Lou Agosta in the August 2004 issue of DM Review. [6] Predictive analytics -- or “one-click data mining” through the simplification and automation of the data-mining process -- enables advanced analytics to be applied across the entire life sciences spectrum -- from drug discovery through marketing. Researchers, for example, can use predictive analytics to find factors associated with a disease or predict which patient might respond best to an experimental treatment.

In a healthcare delivery environment, providers can use predictive analytics to forecast medical outcomes; find factors associated with the highest-cost patients, doctors, and care facilities; and even detect insurance fraud. In addition, pharmaceutical manufacturers can use the technology to increase the efficiency and effectiveness of their marketing initiatives, predicting which doctors are most likely to prescribe new drugs and identifying patients most likely to be high-value customers.

Predictive analytics isn’t a panacea. “Garbage in; garbage out” still applies -- but it’s based on the concept that some advanced (and highly automated) analysis is better than no analysis.

Text Mining: Liberation At Last

“Merrill Lynch estimates that more than 85 percent of all business information exists as unstructured data -- commonly appearing in e-mails, memos, notes from call centers and support operations, news, user groups, chats, reports, letters, surveys, white papers, marketing material, research, presentations, and Web pages.” [7] The vast majority of pertinent life sciences data is unstructured, residing in the text of scientific papers and journals, Web pages, electronic medical records, e-mails, textual information embedded in structured databases, and product registration documents and filings. The inability to search text for relevant information has led, no doubt, to countless missed opportunities for faster and less expensive drug discovery and development.

Until recently, data-mining applications offered only marginal assistance on this front because they were limited to queries of structured data. Today, emerging text-mining applications are helping life sciences organizations expand their ability to track, assess, and analyze the vast sea of pertinent scientific and clinical data. Oracle’s approach to text mining embeds data- and text-mining capabilities natively in the Oracle Database so that structured data and unstructured data may be mined together using the same tool.

Text-mining applications in life sciences include classification (placing a document in a number of predefined categories); clustering (unsupervised division of patterns into groups); information extraction; and question/answering. [8]

Text mining enables, for the first time, automated integration of unstructured clinical data and structured genomic data. There is tremendous data in patient files that -- when anonymized and harvested, combined with genomic, proteomic, and phenotypic data, and then data-mined -- will yield amazing new discoveries. For example, a healthcare facility may have extensive information on a local population’s health histories, eating, and exercise habits. With text and data mining, it could harvest and analyze this information, potentially identifying patients with the greatest risk for chronic long-term hypertension, cancer, and many other diseases.

Walter Reed Army Medical Center in Washington, D.C., as reported at the recent Oracle Life Sciences User Conference, has been using data and text mining to model patient populations and define chronic disease groups, a process that has improved its ability to help diabetic patients manage the disease and reduce its complications.

Semantic and image mining will be the next frontiers of data mining. These emerging technologies will enable researchers to find semantic meaning hidden in data and documents, share and integrate information with others, and find more valuable insights.  By having software become more aware of the “meaning” of the data and concepts, for example, a researcher may search for a certain concept, such as retinal macular degeneration, and discover some related knowledge about some other seemingly unrelated cell apoptosis phenomenon.

Combine this semantic analysis with other emerging technologies, such as image mining and searching images for “similar” patterns, and researchers could look at cell histology samples and search vast and barely tapped image databanks for similarities or changes in the structure or colors of other images. The capability of integrating data to extract new information and tapping previously unmined, “unstructured” data (text and images) opens to the door to many exciting possibilities for new research discoveries.




3 Berry, M.J.A. and Linhoff, G. Data Mining Techniques for Marketing, Sales and Customer Support. Wiley, 1997.

4 Berger, A. and Berger, C.R. “Data mining as a tool for research and knowledge development in nursing.” CIN May/June 2004.

5 Stephens, S. and Tamayo, P. “Supervised and unsupervised data mining techniques for life sciences.” Curr Drug Disc June 2003.

6 Agosta, L. “The future of data mining -- predictive analytics.” DM Rev August 2004.

7 Blumberg, R. and Atre, S. “The problem with unstructured data.” DM Rev February 2003.

8 Alonso, O. and Ford, R. Text Mining with Oracle Text. January 2004 (PDF at

Charles R. Berger is senior director of product management, life sciences and data mining, at Oracle Corp. E-mail:

Read more about data mining in Bio-IT World’s SourceBook.



Click here to login and leave a comment.  


Add Comment

Text Only 2000 character limit

Page 1 of 1