By Eric Fairfield
May 7, 2002 | Before the Revolutionary War, the American colonies had different cultures and languages. They traded only goods that were easy to trade. After the war, the newly formed states adopted common definitions and forged collaborations, resulting in a huge payoff.
In the three market segments where biologists and IT professionals have primarily interacted to date, there has been little need for common definitions. Hardware suppliers and ISPs can transport files while database makers can store information, all without being fluent in biological jargon. When calculating properties of biological data — cell size, protein structure, or spot intensities — specialty software makers speak physics, not biology.
But the next generation of bio-IT products will require some shared definitions between IT and biology — simplified cross-disciplinary languages. There must be an accurate IT equivalent to: "Predict the aftermarket toxicity of these six potential drugs." At present, "six" is the only word that translates easily between these groups. Similarly, "class inheritance" and "instantiation" have little meaning to most biologists. A DNA sequence is not really a "string" in the computer sense but is a particular one-dimensional projection of a four-dimensional object (the organism and its development). The coding of this string is very nonlinear, includes a number of languages, and implicitly embeds physical laws. So, though the words of the different fields may be English, the definitions differ strongly, even within a field.
Defining the Value
Microarray studies to monitor gene expression provide an object lesson. Microarrays are capable of generating tens of thousands of observations from a single chip. IT specialists store the vast collections of annotated quantified array images — spots — and associated information in relational databases. But determining whether the spot intensities are accurate and precise requires additional languages — genetics, enzymology, biophysics, and spectroscopy, as well as mathematics and statistics.
In some cases, technological advances can lessen the translation burden. For example, as the quality of microarray production becomes more consistent, users do not need to understand as much about their workings, and the burden of translations into and out of biophysics has lessened. But these improvements have, in turn, revealed inconsistencies in the biological part of the experiments. Traditionally, biologists have given spot intensities to IT folk and hoped for the best, with mixed results. Even slight variations in the time at which cells are harvested in the experiments can result in significant fluctuations in gene expression, compromising the analysis. But because of the way the databases are structured, harvesting differences are often not recorded, and statistical analysis and endpoint definitions are, thus, inadequate.
How might IT bring more value to biology? One tack is to create a two-way simplified language that is specific for those working at this interface. Such languages can be created during a project, within a company, or across an industry. For example, dynamic object-oriented databases possess a structure that more accurately reflects the biology than relational databases can. The key to success is making sure that the definitions are well thought out and provide a stable language for both the biologists and IT specialists working with them.
In many key areas such as microarrays or drug discovery, the language of the overall project must include concepts not expressible in any of the original languages or in a simplified language. On a practical level, experimental errors may not be treated properly because they are not expressible in the original or simplified languages. In drug discovery, evaluating the relative merits of two drug candidates involves knowledge of manufacturing, profit margins, intellectual property, and chemistry, not merely biology or IT.
Spelling It Out
In our work, we have created an augmented language that includes the drug discovery languages — pathology, cell biology, computer science, transcriptomics, databases, Bayesian statistics, and high-dimensional mathematics. The heart of this new language is mathematics, the body is the other drug discovery languages. The mathematics has provided a compact, consistent, testable way of expressing the other languages.
It is easy to create new words for new concepts — biologists and IT folk do
|Despite widely different cultures, biology and informatics will learn to speak a common language because the drug discovery forces driving them together — profit margins and time to market — are huge.
it constantly. Effective simplified languages and the larger language, however, require accurate definitions on which all agree. This is much harder. With the help of experts in each field, the definition of genome still took one and a half days to create. Precise definitions and constant translation are critical, with the result that difficult questions become quantitatively and clearly expressible. This new language could address a major difficulty in microarray studies — propagation of error. Key conditions about the data structure (How 'different' are two experiments or two sets of experiments? Or, in mathematical terms, 'What is the distance between experiments?') should be resolved so that valid analysis can be made.
Creating a common language for biologists and information technologists involves discussing and writing (and rewriting) a number of key definitions to tackle immediately soluble problems. The definitions of this new two-way language are often straightforward and stable over time. For definitions that apply to more than two disciplines, a translator (a person fluent in each discipline) can help create the original definitions. Eventually, much of the translation could become software.
With simplified or larger languages, it becomes easier to avoid false starts. Over a two-year project, learning the simplified language might cost one week of training, yet could save six months of project time per person. People are more likely to accept this approach because they know that the terms are unclear and create frustration, especially if unclear definitions mean that they are not all working on the same problem.
Bio and IT can help each other substantially by transforming from colonies to states.
Dr. Eric Fairfield is CEO of Fairfield Enterprises, which is based in New Mexico and specializes in translating the languages of drug discovery. For the past four years, he has focused on extracting value for drug discovery from microarray data. He can be reached at email@example.com.
Horizons is a new section debuting this month in Bio-IT World. Horizons will offer an eclectic collection of stories, profiles, commentaries, and conversations that paints a broad canvas of thoughts, trends, and technologies that are poised to define the bio-IT landscape.