By Malorye A. Branca
April 7, 2002 | Data standards for microarrays may be inching toward acceptance, or at least data exchange may become easier, thanks to independent efforts by some key groups, including the Microarray Gene Expression Data Group (MGED), Affymetrix and the University of California at Santa Cruz’s Golden Path Web site.
“I’m close to 100 percent optimistic that this [adoption of standards] will come about,” says Alvis Brazma, president of the MGED, which will be spearheading an effort aimed at universal standards. “To achieve this, we will probably need the stick as well as the carrot,” Brazma says, “but that is coming.”
The amount of gene expression data generated by DNA microarrays is steadily increasing, but because of the diverse ways researchers collect, store, and publish that information, its use is greatly limited. A universal format would allow the easy exchange, pooling, and comparison of results.
The MGED and other champions of standards are uniting in support of the Object Management Group’s proposed standard, dubbed MAGE (Microarray and Gene Experiment), which was announced at the MGED’s February meeting. MAGE stipulates a markup language -- MAGE-ML -- and an object model, MAGE-OM. Researchers would be able to share MAGE-compatible data seamlessly. Along with the MAGE format comes the MIAME (Minimum Information about Microarray Experiments) standard, which applies to scientific content. Each of these documents contains many requirements.
MAGE has support from leading organizations such as the Sanger Institute, Stanford University’s microarray group, and the Institute for Genomic Research (TIGR), which have agreed to pipeline MAGE-compatible data to the European Bioinformatics Institute (EBI)’s ArrayExpress database. (The EBI is a key force behind MGED.) In February, Sanger began testing the pipeline, submitting MAGE-ML-formatted data from an experiment using its S. pombe chip. TIGR and Stanford will be following suit as they reach MAGE-ML compatibility. The three groups will also encourage their contributors to become MIAME-compliant. To help users work with the standards, EBI has developed MIAMExpress, a “biologist-friendly,” Web-based tool for entering MIAME compliant data, which is now being tested.
Success depends largely on motivation: Stanford, for example, has more than 23,000 arrays in its database, none of which contains the MIAME-required information. Software developers such as LION Bioscience, Iobion Informatics, and Rosetta Inpharmatics have agreed on the standards, but there is no deadline for implementation. As a result, other steps may be needed, such as encouraging leading scientific journals to require the use of these formats for publication.
Affymetrix, meanwhile, is taking steps to help its users more easily exchange and integrate information. As part of its “Open Access Initiative” announced in February, the company is giving software developers better access to its interfaces and is releasing array information, including probe sequences and annotation. “We see this as critical to the next stage for the field, in which scientists are building a worldwide body of expression data that is comparable and verifiable,” says Peter Dansky, senior director of informatics marketing for Affymetrix.
The academic community quickly hailed the release of the Affymetrix array information. The EBI will be one of the first groups to exploit the descriptions. It plans to fully incorporate these into ArrayExpress -- in MAGE-ML format -- by summer.
The array information, as well as some actual expression data, is also going into the Golden Path assembly of the human genome. Accessible through UC Santa Cruz’s genome browser, Golden Path is one of the three major public sites offering access to the up-to-date sequenced human genome. The browser contains approximately 30 tracks, or classes of features, including several for different gene expression analysis platforms. “On the Affymetrix track, we show where they have picked the actual probe and, in collaboration with Novartis, the expression of the gene from that chip,” says David Haussler, professor of computer science at UC Santa Cruz and co-director of the project. The Genomics Institute of the Novartis Research Foundation provided the gene expression data from 31 human samples.
As more gene expression data becomes publicly available — and more researchers pick up the standards — proponents hope that a trickle-down effect will occur, leading to gradual improvement in the overall quality of the data. “Time and time again, the Internet community has shown it is capable of setting up checks and balances so that the valuable stuff survives and the rest gets discarded,” Haussler says. “This should not be an exception.”
Getting so many scientists to agree on anything may be expecting too much. But without some progress on standards, the data sets will remain too small for the type of pooling and validation required to fuel the discoveries anticipated from the field.