By Mark D. Uehling
November 15, 2002 | Suppose every thermometer had a different amount of mercury, different hash marks, and different temperature units. That chaos is a simplified version of the current state of affairs with microarrays. "Hundreds of papers have now been published, but no more than a handful have presented data in the same format, and none has provided adequate contextual information to allow reproduction of experiments," University of California--Berkeley biologist Paul Spellman noted recently in the journal Genome Biology.
Microarray data are finally about to clean up their act. The Microarray Gene Expression Data Society (MGED) is driving the change, and software and hardware vendors are on board, with the early participation and blessing of major chip suppliers such as Affymetrix and Agilent.
That means scientists will soon be able to analyze the microarray data of colleagues as MIAME-ready data trickles off instruments and into databases. (MIAME stands for "minimum information about a microarray experiment.") Even better, as of early October, three major scientific journals – Nature, Cell and The Lancet – a nnounced that they will insist that MIAME-compliant microarray data are released on the Internet or deposited in public databases.
Science is also poised to embrace MIAME. Barbara Jasny, the journal's supervisory senior editor and an MGED advisory board member, says the publication is studying the standard and supports its intent. "We see this as a laudable process and want to participate," she says.
However, Science will not insist its authors submit their data in MIAME-compliant form until some of the details are worked out. Says Jasny: "We need to find out people's experiences when they do submit, and we need to find out if the databases are ready to take all the data."
To hear Alvis Brazma of the European Bioinformatics Institute tell it, the new databases and software developer kits to take advantage of the MIAME standard are in good shape now. At the 14th Annual Genome Sequencing and Analysis Conference (GSAC) in Boston, Brazma stated emphatically: "The MIAME requirements now have to be met. This is no longer an academic exercise."
Brazma points out that, aside from a species' name, there is precious little agreement on what to call anything that goes into or comes off a microarray. Context -- or the metadata of microarrays -- is everything, because both the chips and the experiments can be customized or homemade, as well as purchased from major suppliers such as Affymetrix.
Assume, said Brazma, there are 30,000 human genes, 320 cell types, 2,000 druggable compounds. If a microarray mixes and matches the above in five separate organisms at two different compound concentrations at three points in time, it would result in a petabyte of information.
In addition, microarrays generate images. If primary image data are added to the experimental load for a given microarray, Brazma noted, the data from a single chip could mushroom into the petabyte range, "which would fill this room with DVDs very easily."
"There is a clear need for establishing infrastructure for data sharing," Brazma said. "But microarray data are much more complex than sequence data, and it is not obvious which data you want to share."
That's part of the reason the MGED is sponsoring a third M-word: MAME-ML, a flavor of XML for microarray and gene expression. A database of sanitized data is in the works, Brazma noted. "We are trying to develop not a data cemetery, but someplace you can look up the data."
Brazma went on to say that acceptance of MIAME - which is being adopted at the Wellcome Trust Sanger Institute, the National Cancer Institute, the Whitehead Institute, and Stanford University -- was moving faster than he predicted. Even the vendors are coming around. "Affymetrix, for a while, had been a bit cautious," says Brazma. "But because of pressure from the scientific community, they have decided to become very open."
Indeed, the audience at GSAC learned that microarrays contain so much data, in so many dimensions, and are used in so many ways, that it can literally require a doctorate in statistics to know if you have interesting results or not.
Case in point: Gary A. Churchill of the Jackson Laboratory, who also spoke at GSAC. "It's very messy data," Jackson said after more than 30 minutes of intricate descriptions of how to set up microarray experiments. "As a statistician, I am patiently waiting for folks like you to clean it all up."