March 10, 2003 | THE RESPONSE to the question of standards is always, "They're a good thing in the right situation." Informatics in life sciences research and development would appear to be that right situation (lots of tools combined in many different ways, manipulating similar types of data). Let's take a look under the hood to see when standards succeed and when they fail.
The problems created by lack of consistency and standards in R&D informatics are particularly acute. Redundancy and inefficiency are rampant. The error rates from constantly having to massage data into the appropriate format and maintaining applications as formats change are a bioinformatician's nightmare. The result is a significant drag on the promise of the genomics revolution, slowing the pace of progress in research and drug development.
The real issue is a pragmatic one when it comes to standards efforts. Some things clearly should have standards, such as electricity, weights and measures, and microarray data. But playing a meaningful role in any standards effort typically involves considerable work over a prolonged period — perhaps years. Is it worthwhile? The answer is not so straightforward.
The growing list of standards-focused initiatives means that even relatively small organizations may encounter several standards that are more or less relevant. Their potential impact on long-term strategy and operations cannot be ignored. The first task is to assess whether the possible results of each standards effort will affect the organization — either positively or negatively — enough to justify further action. If the impact is potentially significant, handicap the likelihood of success. History suggests the majority will be unsuccessful. Finally, in the event an effort looks likely to succeed, what level of participation makes the most sense? The range of possibility is broad.
The new microarray gene expression (MAGE) standards serve as a case in point. The prospect for widespread adoption was enhanced considerably when several major scientific journals, including Nature, The Lancet, and Cell, recently announced that for publication of microarray experiments, they would require MIAME (minimum information about a microarray experiment)-compliant data to be accessible from the Internet or deposited in public databases.
|Publications and Standards
|The establishment of a central repository that now contains all publicly available DNA sequences exemplifies the role of research journals concerning data standards and interchange. Despite widespread agreement that a central repository would benefit the entire scientific community, there was resistance. Standards issues were part of it, along with concerns about the effort required to comply. Journals were pushed to publish papers only if the sequence data in the papers were submitted to GenBank. Recognizing their powerful role, journal publishers moved cautiously. Concerned that adopting the wrong standard too early might result in one that was too difficult to meet or didn't contain enough information, journals adopted this requirement only after the community reached a consensus.
In this way, the industry has taken another major step toward standardizing the storage and exchange of microarray studies, benefiting a broad range of organizations. But a smaller group will be negatively affected. Consider that the MAGE initiative has several parts: MIAME, the new information standard; MAGE-OM (Object Model), which provides a data exchange model; MAGE-ML, an XML-based data format; and MAGE-STK (Software Toolkit), a collection of packages that act as converters between MAGE-OM and MAGE-ML.
With hindsight, it is easier (though still somewhat early) to project the local impact of the MAGE initiative's success. Early on, when there was the dreaded alphabet soup — GATC, GEML, MAML, and so on — of potential standards for microarray data and exchange, it was not so easy. Every producer, processor, or consumer of microarray data potentially had a stake in these activities. Some who were involved probably think their time was wasted, others probably consider it a good investment, and perhaps some of the uninvolved groups regret their absence.
In each effort, many questions needed to be addressed, such as: Whose data and what data (internal or public) was the standard geared toward? Which gene expression technologies might the standard ultimately address? Less straightforward questions included: Does the adopted standard address issues relevant to an organization such that it would end up adopting or opposing it? And what infrastructure changes are required to achieve compliance (at least at a cursory level)?
To Get Involved... or Not
Once convinced that a standards effort is relevant, the next issue becomes the likelihood of adoption. The issue of software and data standards has long been of interest to biologists and bioinformaticians. It is both intellectually and practically attractive to have such standards, but few useful examples have emerged through official standardization efforts. Most of the standards widely used in bioinformatics, for example, are de facto ones induced by successful programs and databases. Familiar examples from the sequence world include FASTA, phred/phrap, GenBank, Swiss-Prot, and UniGene.
The low success rate has many causes. In some cases, what may seem simple at a high level becomes incredibly complex at the detail level. The pressure is then on to explore all the details. This increases the time and difficulty it takes to reach a consensus, as well as the complexity of the standard that is produced. It all results in significant barriers to success. The size and composition of the group of active participants is also a key factor. In the end, as with MAGE, research journals can play a critical role in the rate and degree of adoption (see "Publications and Standards").
|Software and data standards have long been of interest to biologists and bioinformaticians. But few useful examples of such standards have emerged through official standardization efforts.
If an organization is likely to be significantly affected by a standards effort, it may be inclined to get involved even if the chances for success are slim. Or it may choose to stand on the sidelines despite optimism concerning the outcome. The exchange of ideas, integration in a community with similar interests, and the work product may be sufficiently valuable to warrant the effort, regardless of the prospects. Alternatively, the organization may trust the parties involved and have no particular stake in the process, in which case minimal involvement may be the most appropriate. If all of the organization's main competitors are involved, not participating may be its only realistic option.
The informatics challenges of biotechs, pharmas, and research institutions suggest that standards in many discrete areas are worth pursuing. But much will continue to be accomplished without them, given the challenges to effective adoption. In the end, we must remain realistic about the role standards will play in addressing many of the big integration and interoperability challenges — and direct our resources accordingly.
Richard Dweck is the founder and CEO of 3rd Millennium, a bioinformatics consultancy based in Cambridge, Mass. He can be reached at firstname.lastname@example.org.