COMPANY TO WATCH
By Michael Gross
June 15, 2003 | The evolution of computers has progressed exponentially over the past three decades, as Gordon Moore so famously predicted. And yet, the complexity of computational problems seems to keep pace. Indeed, many problems are difficult or impossible to handle because their complexity rises (at least) exponentially with the number of variables. Thus, a problem that contains a hundred or more variables is essentially intractable on any timescale less than the age of the universe.
In the long term, the development of quantum computers may allow scientists to overcome this exponential complexity. But for now, the only hope lies in computer programs that use Charles Darwin's principles of mutation and selection to cut complex problems down to size. This is the tack taken by the Welsh company Aber Genomic Computing (AGC; www.abergc.com), a spinoff from the University of Wales at Aberystwyth.
A decade ago, the major problems in computation arose from complex manmade systems. An example is train timetables, where a large number of parameters are constrained by numerous connections and inter-relationships. The scientific prototype of such challenges is the traveling salesman problem — finding the shortest route to link a number of cities. Even for this simple challenge, the compute time rises exponentially with the number of cities, soon becoming impractical.
Nowadays, the biggest computational problem comes from nature, via genome sequencing projects, which have yielded a flood of data and related proteomic data. For example, if a scientist studies a number of genes and asks which of them are pairwise related to each other (perhaps in terms of evolutionary similarity or synchronized activity), the number of possible solutions quickly becomes astronomical — long before the researcher has even reached the number of genes in the simplest bacterial genome.
The ABCs of AGC
So, fighting fire with fire, can one use evolutionary computing to beat the complexity that evolution has created in the genomes? AGC has developed a data-mining platform that does just that.
AGC was founded in December 2000 on the research expertise of two Aberystwyth professors: Jem Rowland from the Department of Computer Science, and Douglas Kell, who moved recently to University of Manchester Institute for Science and Technology (UMIST). Both have collaborated closely in the field of applying machine learning to the increasingly complex problems of analyzing biological data. They hired biotech executive Stuart Webb as CEO with special responsibility for sales and marketing. Rowland now serves as a nonexecutive director, while Kell continues to provide scientific input to the company and promote its activities. Kell is also a co-founder of Aber Instruments Ltd., a leading manufacturer of biomass monitoring equipment for industrial fermentation.
Initial financial support came from the Wales Spinout Programme (www.spinoutwales.co.uk) in 2001. The Aberystwyth Challenge Fund (www.aber.ac.uk/challenge) supplied further funding in 2002. AGC is actively seeking further investment to recruit key personnel and boost product development.
In September 2001, AGC launched its first major product, the genomic computing program gmax-bio. Gmax-bio evolved from a tool originally created for direct marketing (gmax-dm), applying the same mathematical approach to problems in life sciences and social sciences.
Gmax is a data-mining and analysis tool that uses genetic programming methods to encourage the spontaneous emergence of natural models. Natural models capture the unsuspected, latent structure and relationships between variables that conventional methods fail to reveal. "Natural models are robust and have unparalleled predictive power," says mathematician Michael Sinclair, gmax architect and original programmer.
Essentially, the package aims to create simple rules (algorithms) using Darwinian evolution among a population of possible rules. These algorithms combine the input data in specific ways to create output data. A simple example of such a rule would be: "add all input data, write out sum," that is, y = x1 + x2 ... + xn.
The Simple Facts
While the program can handle complex algorithms, it favors the simplest solution. The user can allow or disallow specific kinds of mathematical relations for inclusion in the algorithms. In a typical program run, the program would start from a first generation of up to 10,000 algorithms and test them on the available data sets. Specifically, the program is designed to minimize a measure of entropy, so it goes for the simplest set of rules that fits the data.
While most of the first-generation algorithms produced at random will not do anything useful to the data, the program will identify those that perform best and breed a new generation from their combinations. This process is reiterated, and the user can watch the progress the program makes along its decision trees (see "Tree of Knowledge"). The end result is a simple equation that users can apply to their input parameters to generate predictions.
For the skeptics, Sinclair has some examples that can serve as control experiments. "We can feed the program with a few rectangular triangles," he explains, "and it will discover Pythagoras' theorem (a2 + b2 = c2) in a split second. If we disallow the square function, it will still get there, but it takes a little longer before it figures out a*a + b*b = c*c."
In the life sciences, gmax-bio can perform data mining in complex data sets including genomic sequences and microarray data, and produce algorithms that will provide simple, predictive answers. For example, researchers at Glasgow and Aberystwyth, including Kell, have used this approach to develop a method for rapid multisample screening for the presence of bacterial spores. Analyzing the complex data sets resulting from spectroscopic analysis of pyrolysed cells, the evolutionary algorithm developed simple rules that linked the presence of spores to spectroscopic parameters.
This approach arrives at a solution quickly and easily. With other methods, neural networks tend to be "poor at explaining how they arrive at their answer," Kell says, while "traditional rule-based systems often perform poorly on complex data sets."
Webb has attracted a diverse range of customers from food science to pharmaceutical, biotechnology, and instrumentation companies. As for new directions, he says: "AGC's data-mining approaches attracted the, for once, welcome attention of police and security agencies, on the national and international level. They are urgently seeking superior tools to study criminal behavior."
Michael Gross is a science writer in residence at the School of Crystallography, Birkbeck College, University of London. See www.proseandpassion.com for details about his latest book, Light and Life.