YouTube Facebook LinkedIn Google+ Twitter Xingrss  

Data Mining and the Euredit Project


By Stephen Langdell

Sept 15, 2005 | By definition, practitioners in multidisciplinary fields such as bioinformatics need to keep abreast of the latest research trends in several areas. However, the relevance of research work focused in areas outside their normal specialty may not be immediately obvious to all observers.

For instance, a multimillion-dollar, four-year research project on statistical methods for data mining, called Euredit and funded by the European Union, is creating interesting possibilities for many bioinformatics endeavors, including microarray analysis and forecasting trends.

Euredit's goal was to analyze census surveys. Because of human nature, these surveys usually contain missing or incorrect data. European governments funded research to determine and improve statistical techniques that could be used to clean this data by filling in gaps or highlighting errors. This research was carried out by government national statistics offices, universities, and research-based companies. The cleaned data produced by this research were then used to better estimate the need for services and programs in communities. The result of this intensive research study was a number of new algorithms relevant to disciplines where data mining is important, with bioinformatics perhaps one of the top beneficiaries.

For example, whether one is attempting to identify population subsets with certain characteristics as part of demographic analyses or looking for interesting groups of genes in microarray studies, the application of cluster analysis techniques is key. Cluster analysis functions in commercial packages require storage in computer memory of an n-by-n matrix of similarities (or differences) between "n" genes. This storage requirement limits the number of genes that can be studied at any one time.

Similar restrictions on data set size were problematic for demographics researchers using cluster analysis techniques for European population analyses. Consequently, methods that obviated the need to store n-by-n similarity matrices were developed during the Euredit project and hence provided the ability to study much larger data sets. These developments could equally be used in the bioinformatics field.

A similar Euredit development concerned logistic regression techniques. In bioinformatics, logistic regression is used to classify data such as correctly assigning patients to risk levels or performing amino acid coding in DNA analyses. Here also, the typical size of data sets cannot readily be analyzed using traditional algorithms and hence can make for challenging research problems. To eliminate this problem, regression models with out-of-core optimization, sometimes called data chunking, were developed during the Euredit project, dispensing with the need to store entire data sets in computer memory.

Another example of new data-mining techniques developed during the Euredit project that can be applied in bioinformatics research are methods for identifying unusual cases hidden in data, known in data-mining terms as outliers. After four years of study of various algorithms to handle outliers, the Euredit project developed methods to identify outliers in both categorical and continuous data. The algorithms were designed to handle very large data sets and yield results with higher accuracy. These new data-cleaning algorithms have yet to be applied extensively in bioinformatics research.

Until Euredit, decision trees - methods used to discover diagnostic rules (i.e., rules in human-readable form) in data - were very susceptible to outliers. To counter this, Euredit developed a regression tree that is robust with respect to outliers. It differs from all other regression trees by automatically weighting the data at nodes in a decision tree such that outlier effects are either removed or minimized.

At the conclusion of the Euredit project, the Numerical Algorithms Group (NAG) undertook to disseminate Euredit's findings. To that end, NAG created the first commercially available data-mining application toolkit that uses the new algorithms. These algorithms (along with many others) are provided as components that can be easily incorporated into a user's existing applications.

To learn more about the functionality in NAG's new data-mining components, please see www.nag.com/dr.

 

Stephen Langdell, Ph.D., is a member of the Numerical Algorithms Group, a worldwide organization dedicated to developing quality mathematical and statistical components and 3-D visualization software. E-mail: stephen.langdell@nag.co.uk.

 

 

Click here to login and leave a comment.  

0 Comments

Add Comment

Text Only 2000 character limit

Page 1 of 1

For reprints and/or copyright permission, please contact  Jay Mulhern, (781) 972-1359, jmulhern@healthtech.com.