By Vivien MarxMay 9, 2003
| "I call it sneaker mail," says Scott Fraser, director of the Caltech Biological Imaging Center at the Beckman Institute. "I burn a CD with images and carry them down the hall or to the building next door to the colleague who wants them. Not efficient!"
Fraser's frustration is emblematic of the IT challenges facing academic labs with significant image output. The hint of the solution, however, may be found 2,000 miles away: a paperless, wireless hospital without central nursing stations, without film-based medical records or storage areas for them. The fully digital environment at Indiana Heart Hospital, which opened last December in Indianapolis, ensures that all patient records (images, charts, and medications) can be input and retrieved at the bedside. This hospital has integrated information systems into its caregiving, diagnostics, medical imaging, and communications. That makes Indiana Heart's biomedical researchers and clinicians IT managers, too.
Biology and medicine are rapidly becoming number- and data-crunching disciplines. Images are a key component of biomedical data, and as more sophisticated imaging methods are developed, a glut of microscopic, photographic, radiological, and computer-based images have to be stored, mined, and used for further research — shedding new light on gene-signaling cascades, the docking of a drug to its target, or the effects of medication on disease progression.
|Some IT innovations need to be invented twice. Such is the case with a community-based, ambitious approach to image storage and retrieval called BioImage, funded by the European Commission.
Aside from their research value, images form part of the records that might be submitted to the FDA as part of a drug approval process. To lose them or manage them poorly is at one's own peril.
Even supposedly easy-to-find images are out of reach. Research labs publish only a fraction of the images they generate. Since researchers cannot query the full text of the scientific literature to track images, they remain essentially invisible. Given how much time it takes to generate a multiwavelength confocal image, image invisibility is deplorable (see "BioImage Conscious").
"Biologists will typically do their experiments, extract the information using some piece of software, save the data in an Excel spreadsheet, which has little information about the context of an experiment, then throw away the image."
—FORBES DEWEY, MIT
In both the academic and corporate worlds, a search for a solution is under way, so that images can be stored, found, shared, and mined. That means changing the way biologists view their data. Forbes Dewey, professor of bioengineering at MIT, explains: "Biologists will typically do their experiments, extract the information using some piece of software, save the data in an Excel spreadsheet, which has little information about the context of an experiment, then throw away the image."
Increasingly, however, investigators appreciate that an image's value stretches beyond one-time data extraction and that it must be stored for further use. That's a lot of data, however. As Dewey says, the current MIT paradigm in biological imaging is that a good post-doc can create several terabytes of data a year. Perlegen Sciences produces 1.5 terabytes of data per week. And some 3-D two-photon imaging sets can run as high as a terabyte a day.
When it comes to storing digitized images, there are no established best practices. "Everyone is, to varying degrees, re-creating the wheel right now," says Eric Neumann, vice president of bioinformatics at Beyond Genomics.
|Medicine, the Image Part of It All
|A CT scan of the human chest can run in the order of 70 MB. Hospitals and medical research involving imaging such as MRI, X-rays, and PET scans face huge image storage and analytical challenges as digital media replace film.
According to Jacquelyn Fetrow, co-founder of GeneFormatics (now part of Structural Bioinformatics), the banking industry and the military have as much data to juggle as the biotech industry. Defense contractor Northrop Grumman, for example, is working on adapting military pattern-recognition technologies to medical imaging. "But the understanding of a need for data management and process engineering is not happening in biotechnology," Fetrow says (see "Medicine, the Image Part of It All"
At Structural Bioinformatics (SBI), the focus is on proteomics-driven drug discovery; processing X-ray crystallography or computationally derived data on protein structures; and performing pattern recognition and data fusion. Large-scale data crunching is home turf to company subsidiary SBI-Moldyn, formerly part of a defense and aerospace firm. It applies techniques for analyzing data-heavy space and satellite images, which contain information gathered at various wavelengths, to the life sciences. "This type of data analysis is new to the life sciences but old for those industries," CEO Edward Maggio says. "A problem may confound our life scientists, but the mathematicians and physicists at Moldyn will identify exactly the mathematical approaches one can take."
SBI offers accelerated ways to determine and predict structures that are central to drug design and development. StructureBank, an object-oriented database that stores and manages SBI's collection of protein structures, also permits comparative analysis of hundreds of thousands of them. For example, company researchers are evaluating the family of tyrosine phosphatases as therapeutic targets for cancer, obesity, diabetes, and inflammation. "A challenge is to make a drug that works toward only the target and does not interact with similar proteins," Maggio explains. "With our database, we can say, 'Here is my drug target of interest; pull out all the ones with the 3-D structure that are closest to our target of interest.'"
| Into the great beyond: At Beyond Genomics, Vice President of Bioinformatics Eric Neumann and colleagues set up a relational database and associated knowledge tool for "systems biology informatics."
The IT tools in StructureBank were developed as a joint effort between modelers, biologists, and computational chemists, and they were designed to allow analysis and data sharing across the enterprise, including information on sequence, 3-D coordinates, and electrostatic charge. A combination of virtual and biochemical screening tools allows researchers to examine the intricacies of target binding with a given protein. Asking questions of a protein is not trivial, Maggio notes, given the complex structures and the file size of 10 MB to 14 MB.
StructureBank unites information generated at the bench and the keyboard. Having this kind of information management in place, Maggio says, requires an upstream investment that makes downstream querying easier. "You can't extract knowledge unless the information is in a condition in which you can interrogate it properly," he says.
At a "sample-driven" company such as Beyond Genomics, where scientists may be using laser capture techniques to study a morsel of tissue, image correlation must have a context. Neumann says his researchers extract as much information as possible from the sample and store it — source of sample, antibody stain, gene-expression data, fluorescence distribution — as a seamlessly connected data set. "You need to keep track of [not only] that information, but also the knowledge it gives rise to," he says. "Those data need to be tagged to the image, and the image correlation with all your other work is crucial. That is where there is a big, big gap."
One tricky element is the intrinsic variance in samples. In cell biology experiments, researchers might pool samples — normal cells in one pool, diseased cells in another batch — to cancel out noise. "You can't do that with a clinical image — you can't pool an image, make a morphed image," Neumann says. "We have found a way to ... extract meaning from the variance." This mining can apply at the level of gene expression or other cellular phenomena or clinical information, such as heart rate and blood pressure.
Neumann and colleagues have set up a relational database and associated knowledge tool to permit "systems biology informatics." Part of the company's pattern-recognition, clustering, and data-mining software — known as BioSystematics — is being used for inhouse R&D as well as collaborations with, for example, GlaxoSmith-Kline to study disease pathways.
"We store the rich clinical information from a slice of cancer tissue in a quantized form and connect it back to the micrograph," Neumann says. Keeping the structural integrity of the tissue intact, company researchers might be looking at how malignancy relates to the amount of staining around the nucleus, or checking if fluorescence fluctuates with any of the thousands of players in the company archive. "If it does with 10 of them, bingo, we have a pathway ... and that has IP potential," he explains.
The knowledge representation tool allows the information patterns mined from thousands of samples to be assessed in light of what is already published about a given pathway. "We go across database silos, and we can see how strong the relations are between the elements," Neumann says.
As at many companies, Beyond Genomics develops its data-modeling solutions inhouse. "We do keep kicking around the idea of releasing our software into the public domain," Neumann says. "But we have not reached a decision on that yet."
Companies may view the same data differently depending on their scientific premise. But in principle, all would welcome a set of standards that apply to the digitization as well as the metadata attached to images. This would enable images to be stored and retrieved easily. According to Neumann, so much new technology is coming out that standards should be kept expressive and open: "That is the difficult challenge with standards, and that discussion is only just beginning."
The Interoperable Informatics Infrastructure Consortium (I3C) is assessing how that might be accomplished. A new I3C working group on images — including mass spectroscopy, microscopy, microarray fluorescence data, and even radiology — headed by Joyce Peng, senior product manager for life sciences at Oracle, and Forbes Dewey of MIT, convened last November. The members set out by agreeing to disagree.
"There are two ways of thinking about images in this group," Peng explains. "Some people would like to take the images and describe their metadata and relationships in XML. Others want to explore how the DICOM [Digital Imaging and Communications in Medicine] standard could be expanded."
DICOM is a standard for medical imaging (CT, MRI, ultrasound, digitized X-rays) that describes data structure, storage, and file formats. It uses a hierarchy system to identify an image; data elements can be added when describing an image, defining what information must be retained with it. The latest standard, DICOM 3.0, expands the standard for a networked environment. Dewey believes DICOM has made life easier but is not perfect. XML is attractive but tends to be verbose. There might be a way to use the fortes of both for biological image storage, retrieval, and querying.
Complicating the standards issue is that equipment from different manufacturers usually captures images in incompatible formats. "This makes it very hard to extract information from these images and find the common set of attributes. You need special software for each one," Peng says. The imaging group's objective is to find those common attributes that apply to the capture, storage, and retrieval of metadata. The I3C group can then approach manufacturers when a final agreement has been reached.
The lack of tried-and-true methods in handling, storing, and mining images is a void waiting for problem solvers. One candidate is Oracle interMedia, a potentially expandable feature of Oracle's object relational database, which already lets users store, manage, and process multimedia data such as images, audio, and video. "It allows users to do integrated searches across the media content, annotations, and other relational data to facilitate searches and analysis," Peng says.
The program extracts metadata from various media formats, such as TIFF, GIF, and JPEG, and performs automatic processing such as format conversion, thumbnail generation, cropping, and compression/decompression.
|Image Management and 21 CFR Part 11
|Images and accompanying data are part of a drug approval submission to the FDA.
Several companies are building on Oracle's database system. Scimagix, for example, offers software image management systems for pharmaceutical companies seeking to integrate image and nonimage data. The 2.0 version of its Scientific Image Management System (SIMS) includes the SIMS Regulatory Module, which integrates image and nonimage data and is intended to help companies meet FDA standards for electronic records (see "Image Management and 21 CFR Part 11"
Open Microscopy Environment (OME) is another contender for image management. It is the open-source software brainchild of three scientists: Ilya Goldberg and Peter Sorger of MIT and Jason Swedlow of the School of Life Sciences Research Biocentre/Wellcome Trust Centre at the University of Dundee, Scotland.
Capturing an image of a live cell labeled a certain way is no longer a serious challenge, in Swedlow's view, but extracting and storing the information connected with that image is. The OME relational database permits quantitative analysis of large sets of biological images generated by chemical and genetic screening as well as real-time analysis of living cells. This software, currently in beta-testing phase, turns images into lab tools.
For example, under a microscope with a motorized stage, a researcher who might be looking at how different drugs are affecting cells can take a time-lapse movie of 10 different cells at once. "Imaging has become an assay," Swedlow explains. "What you will want to know is how these cells differ, measure what has moved, measure relative fluorescence; you will want to quantify those differences."
Extracting information from images requires myriad manipulations, usually involving several software packages. They are not interoperable. Moreover, they erase the metadata generated during the previous manipulation. "That is the dirty secret among cell biologists," Swedlow says. "Migrating between software packages is a nightmare."
Losing metadata is counterproductive when you want to store images and mine them along with metadata. So OME is trying to solve the problem by developing an OME XML file format, which targets equipment and software manufacturers. It attempts to encompass all image data and metadata available at the time of image acquisition.
In drug discovery and development, data collections can no longer be warehoused. Image storage and retrieval, sharing work and collecting the conclusions drawn from experiments with visual results, is a sorely needed resource both in academia and the corporate arena. "We know it is important to make data available to others: We share results, materials, or experimental protocols all the time," Caltech's Fraser says. "We really need to figure out how to do that with images, too." Not before time, informatics is beginning to enhance the visual side of medicine and biotechnology.
Vivien Marx is a science writer who splits her time between Boston and Frankfurt, Germany. She can be reached at firstname.lastname@example.org.
ILLUSTRATION BY JOHN WEBER; PHOTO OF ERIC NEUMANN BY KATHLEEN DOOHER; GRAPH SOURCE: E-BIOSCI