By Jeff Augen
November 15, 2003
| For centuries, doctors
have treated patients with a combination of knowledge, personal experience, and clinical observation. During the past 50 years, an accelerated understanding of sickness and health at the molecular level has dramatically advanced healthcare by providing new and powerful diagnostics that extend the art of clinical observation.
In most sciences, as the body of knowledge expands, so does the complexity of the overall picture. The life sciences are no exception. In the early days of the genomic revolution, little was known about the relationships between genes and protein sequences. Much of the puzzle has been assembled, but much remains scattered. Unfortunately, many of the most recent discoveries have accelerated the increase in complexity, and we still have a long way to go.
Some recent discoveries have been the most surprising. For example, only three years ago the one-gene-one-protein view of molecular biology was still intact, and it wasn't yet apparent that both DNA strands contain protein-coding regions running in opposite directions. There have been other fundamental realizations as well, which include:
· A complex regulatory network based on RNA interference
· The role of translational variation and infidelity as a control point for protein structure
· Many of the most important messages in a cell are present in very low copy counts
· Very small variations can have a profound impact on health and disease
· Proteins are multifunctional, and their activity is context-dependent
· There is a surprising abundance of single nucleotide polymorphisms (SNPs)
|Data Analysis Framework for Information-Based Medicine
|The highest level of a data analysis architecture includes two broad categories..
Other discoveries include prions (small peptides capable of transmitting diseases) and autocatalytic RNA molecules that can splice themselves into new forms. Each of these findings has added complexity to the overall picture, while clarifying previous observations.
Technological advances have enabled many of these discoveries and made possible new approaches to clinical medicine. Steady increases in density and decreases in the price of microarrays, for example, and other mRNA profiling technologies have been significant. So has the appearance of several promising high-throughput gene sequencing technologies and the availability of fully digital diagnostic imaging systems such as CAT, MRI, PET, X-ray, and diagnostic ultrasound systems.
These lab-instrument advances have been accompanied by several key trends in information technology, such as ultra-low-latency networks, computer grids, advanced data-mining and pattern-discovery algorithms, and ultra-dense inexpensive storage.
The past several years have witnessed unprecedented growth in the amount of available information about metabolic processes and the gene-expression patterns that control them. This information is having profound effects on the way patients are stratified and categorized, and diseases that originally seemed phenotypically homogeneous are turning out to be very complex at the genetic level.
Unfortunately, this information explosion has barely affected the treatment of individual patients because there is no infrastructure yet to deliver the data, and virtually no end-user tools to make it accessible to researchers and clinicians. Even if the tools and infrastructure were available, there are few standards for storing the information, and no agreed-upon mechanisms for sharing data between medical institutions and research organizations.
Fortunately, the expanding body of biological knowledge and the technology driving that knowledge suggest both a data-management framework and a method for its growth.
Data Class Struggle
Taxonomy for this data-management framework should be predicated on two general classes of information:
· Industrial-scale, high-throughput observation
· Hypothesis-driven scientific research
Both categories are essential because many different sources of information are required to build a complete biochemical picture.
The first major class of information, high-throughput observation, involves managing enormous amounts of data. Included are databases composed of gene and protein sequences, 3-D protein structures, mRNA expression profiles (microarray data), demographic data, and clinical medical records, including medical images. Much of this information is generated using industrialized processes for gene sequencing, mRNA profiling, and X-ray crystallography; the majority of the clinical information is available as a consequence of medical record keeping.
The second major class, hypothesis-driven scientific research, is composed of the constantly growing body of basic life science research results.
The differences between the two classes are subtle but important. Industrialized processes are designed to systematically document a complete system (e.g., whole genome sequences). Conversely, quantification of the up- and down-regulation of a small population of messages in response to a specific treatment regimen would fall under hypothesis-driven research. Such experiments often generate a large amount of associated data and analysis; organizing and storing this information so it can be easily found later represents one of the greatest challenges in modern biology.
Implementing a management framework that supports the needs of both data classes is logistically complex for several reasons:
· The data sources are heterogeneous in the sense that they include many different formats, structures, and data representations.
· Constructing a single interface into such a system is complex.
· Without some type of abstraction layer, individual researchers would need to become facile with several different query systems.
· The data sources that make up such an infrastructure are likely to be very large and geographically dispersed.
· The size and growth rate of these data sources makes duplication and distribution an unattractive solution.
|Where's the Incentive?
|Information-based medicine carries a huge price tag — but promises huge savings, too.
The distribution of clinical information across a widely available infrastructure presents security challenges, too. For example, the system must be designed to defend against devious query strategies. A query that retrieves all patients over a certain age, with a certain illness, and living in a particular ZIP code might refer to a single individual. Anonymity in such a scenario could be compromised regardless of the level of identity protection assigned to the record.
The infrastructure must also support computationally intense applications. Examples include such diverse areas as medical image comparison, microarray analysis, and data mining of clinical records. Since various parts of the infrastructure have different computational requirements, the most effective solution is a grid design that can leverage the power of many machines, regardless of physical location.
Many individual data sources are themselves distributed and must be combined into a single virtual view. A common example involves the virtualization of dispersed clinical databases that reside at multiple medical institutions. A researcher who needs to search these data sources must be presented with a single view that makes the geographical boundaries transparent.
Building this integrated infrastructure may be accomplished in three phases:
Phase I: Define and deploy a fully paperless medical record. Despite the well-meaning efforts of many skilled architects and database experts, electronic health record (EHR) initiatives haven't moved nearly fast enough (see "Integrated EHR: The Final Frontier").
The financial industry, for example, has already adopted a well-structured set of standards that allows for recording, retrieving, and disseminating both financial information and money. Without these standards, it would be impossible to retrieve cash from an ATM anywhere you like. The medical community needs a similar set of standards for exchanging clinical and genomic information.
Phase II: Build electronic links between and among institutions. These electronic links are best described in terms of a communications and data-processing grid. Then it will be possible to share large volumes of demographic and clinical information.
While information sharing has always been a core value in the research community, its virtues haven't been as obvious on the treatment side. However, the ability to search large volumes of clinical information to compare treatment outcomes, demographic and clinical histories, and current disease parameters has the same potential to add value to the treatment side as it has on the research side. Fully realizing this value will require new pattern-discovery and related data-mining tools.
Phase III: Link clinical and research databases. A great many information resources need to be included. All areas of biomedical research would be appropriate because they are all inter-related. Many of these data sources include information about basic cellular processes in the form of gene sequences and expression profiles, protein interactions, known polymorphisms, protein structures, the results of ligand-binding experiments, and a variety of other biochemical and genetic details.
At first glance, it might appear that most of this information has limited clinical significance. However, that view is based on today's treatment regimens and patient stratifications, rather than on tomorrow's more comprehensive metabolic-level views.
The scope and complexity of such systems, coupled with a basic lack of standards, necessitates that deployments proceed slowly. Nevertheless, many academic research medical centers have begun to construct their own systems for delivering information-based medicine. Today, these systems are important differentiators. As they become ubiquitous, collaboration among institutions will be the norm.
Indeed, connectivity among multiple institutions will be crucial to success. Even today's largest medical centers cannot treat a broad enough patient population to make their own records the sole basis for information-based medicine. Such treatments must cull from millions of records spanning hundreds of disease categories. Patients must come from diverse backgrounds that span a broad range of genotypes, and for each genotype the system must include enough records to generate statistically significant information.
Redefining Health and Disease
The ability to mine numerous databases for clinical and molecular-level information will be invaluable in creating new patient stratifications. Today, most diseases are described phenotypically when, in reality, they are complex polygenic disorders with many subclasses. Psychiatric illnesses, such as schizophrenia and depression, should be thought of not as diseases but as phenotypes displayed by patients with many different gene-expression profiles, genome sequences, and medical histories.
A specific combination of these parameters will someday form the basis of a new classification scheme that will dramatically alter the meaning of health and disease, with far-reaching implications for drug discovery and treatment. The first steps have already been taken, and it's time to begin building a shared infrastructure that promotes the practice of information-based medicine.
Jeff Augen is president of TurboWorx, in Boston. He may be reached at firstname.lastname@example.org.
By Jeff Augen
ILLUSTRATION BY DAVID WINK