YouTube Facebook LinkedIn Google+ Twitter Xinginstagram rss  

Managing the Microarray Data Mountain

By Deb Janssen

July | August 2006 | It was just 15 years ago that a group of researchers at Affymax in California, led by Steven Fodor, published a landmark paper in Science outlining the principles of photolithography and microarray analysis. Five years later, Fodor and colleagues, now at a spin-off called Affymetrix, reported the first mitochondrial DNA chip. A revolution had been born.

Today, microarrays have become a mainstay for a vast range of genomic applications, helping to answer fundamental biological questions in gene expression in disease, high-throughput genotyping operations, mapping gene regulatory pathways, and much more. For example, large-scale whole-genome disease association studies often involve thousands of samples and require hundreds of thousands of assays per sample. Not surprisingly, microarray manufacturers are scurrying to satisfy researcher demands for increased array density, sample number, and content flexibility.

This, in turn, is increasing demands on the software programs to support these new applications and massive data sets in a more open-ended type of system. The challenge is to make sure data analysis does not become a crippling bottleneck in the research process.

The Growing Density
“Industrywide, both dimensions of the informatics problem are growing,” says Jordan Stockton, informatics marketing manager at Agilent Technologies. “People are using higher-density approaches, which equates to higher-density arrays and/or smaller tiles in a tiling array. Also, due to the heterogeneity of the diseases people are exploring, you need to look at a larger number of samples.”

Companies are finding that researchers using older software reach a point where developing an understanding from high-throughput experiments takes longer than collecting the data in the first place. Agilent is constantly working to ensure that users can store the raw data efficiently.

“We want to make sure the speed that people are accustomed to when analyzing 10,000 markers from 50 people is matched when they start analyzing 500,000 markers from 2,000 people,” says Stockton. “At some level, there are physical limitations on raw processing powers of computers, but we are constantly in an efficiency game. We measure the success of our GeneSpring software, in part, on how well it keeps pace with the experimental designers who create bigger and bigger experiments.”

Scott Kahn, the former Accelrys executive who is now chief information officer at Illumina, says there are real issues that need to be addressed regarding the physical limitations of computers. The challenge is navigating around the enormous number of samples researchers are generating.

Illumina’s BeadStudio software package addresses these challenges in several ways. “We wanted to keep the process minimalistic and opportunistic on how the data is displayed. That allows you to store the minimal amount of data that is required and provide the other things that are needed ‘just in time’ so you can manipulate larger data sets,” says Kahn.

His team also explored alternative architectures of both hardware and software components. “You can now buy computers having fewer inherent limitations, since the newer computers are built on 64-bit technology and have in this context a limitless amount of memory accessible,” says Kahn.

BeadStudio has started to explore how best to use these larger types of computers for data analyses. But ultimately, people want to be able to use their existing computers for existing analysis needs. “BeadStudio enables the use of other resources of their computer to store analysis data on a disk and to access things on an as-needed basis. This opens up the possibility of analyzing as large a data set as you want without memory constraints,” says Kahn.

 Steve Lincoln, VP of informatics
for Affymetrix, sees improved
scalability and IT flexibility.
Affymetrix’s VP of informatics, Steve Lincoln, agrees: “The size of the average experiment continues to grow, putting pressure on the small desktop applications that many researchers like to use to analyze their data. We are already seeing many experiments where customers need to analyze multiple different types of microarray data within the same software application. For example, applications that can handle both gene expression and copy number data simultaneously are very much in demand in cancer research.”

Edward Fox, director of the Molecular Diagnostics Lab and Microarray Core at the Dana-Farber Cancer Institute in Boston, says, “The one large advantage of using the Affymetrix platform is its very-high-density format that they are able to manufacture their chips on. That is something that is allowing the whole genome tiling array to become available.”

Affymetrix’s core microarray technology is rapidly scaling in ways that provide vastly increasing data volumes and resolution. While the studies enabled by this technology allow increasingly precise and high-resolution views of the molecular state of cells, this produces a corresponding increase in data volume and complexity.

“Changes to our core software infrastructure are enabling ongoing improvements in scalability and IT flexibility,” says Lincoln. “The outlicensing of Affymetrix-developed analysis algorithms and code has already led to the creation of one of the broadest portfolios of integrated software tools that we are aware of in life science research laboratory platforms.” Of several consortia and collaborative research efforts Affymetrix is involved in, the transcriptome project, in collaboration with the National Cancer Institute, seeks to build whole-genome tiling arrays interrogating the genome at resolutions approaching every nucleotide — interrogating the genome at resolutions and throughput rates never before attempted.

Open to Flexibility
Many companies provide software packages with their platforms, while others choose to take a more hands-off approach and offer a more open-ended system, leaving users to decide on the best analysis tools for their data.

“During the development of our platform, we looked across the board and did an exhaustive survey as to what different researchers were using,” says Christopher Streck, product manager/gene expression arrays, at Applied Biosystems. “We found that it was very widespread and diverse in terms of homebrew — people who have built their own analysis packages — using open-source tools, and also commercially available packages.”

Rather than dictating how microarray data should be analyzed, Applied Biosystems facilitates the export of data coming off its platform into open-source analysis tools, such as Bioconductor, or other commercial software packages, so that the data are easily integrated and users can choose their individualized algorithms.

“What is derived from your statistical analysis is basically a list of genes that were differentially regulated in the experiment. What we thought was important was that no matter which statistics someone uses to generate that gene list, the big task is that interpretation of that data is made easier,” says Streck. In other words, what does the spot on the array actually mean to a biologist?

To that end, Applied Biosystems has written and developed publicly available tools for assistance with the bioinformatics and biological understanding of an experiment. One of these includes the Panther Classification System (, a resource that classifies genes by predicted function, based on published experimental evidence and evolutionary relationships.

“It looks at the actual sequence of a gene and resulting protein sequence and compares that back to known protein functions so you’re easily able to identify and classify the actual protein. It’s a way to very quickly and efficiently add biological understanding directly to your gene list,” says Streck.

Through the Panther Web site, users are able to streamline the process, taking the gene list from their microarray experiment and immediately mapping it back to the genome structure, and also to show exactly where the primer/probe designs are located for their TaqMan assays. “From that same Web site, you can also see the different gene expression assays that are available, so rather than having to design the different probes, you can use this resident tool for your gene expression and validation target selection,” says Streck.

It also allows users to cross-reference all publicly available gene identification systems, such as GenBank, download updated annotation files for microarray targets, and access probe sequences. “When you have a tool where all of the genomic information is easily accessible, it can be used across the board in any application in gene expression,” says Streck.

GE Healthcare has also stepped up to the plate by offering CodeLink iCenter, a free online resource that provides researchers with instant access to CodeLink Bioarray content and corresponding genomic information, including probe sequences, array content maps, probe design to sequences, and latest gene annotations. Featured with sorting, filtering, paginating, and output customizing, CodeLink iCenter makes it quite easy for researchers to explore and gain knowledge from the exclusive CodeLink repository of 125,000 probes.

Affymetrix also puts considerable effort into working with its third-party academic and commercial software partners that add value to its GeneChip microarray system by building tools specialized for specific R&D workflows, customer environments, and data analysis approaches. In 2004, Affymetrix introduced the GeneChip Consortia Program, which has successfully helped facilitate community interactions and standardize array data among community researchers. Designs for the Consortia program are selected based on community support for creating a GeneChip array, availability of sequence information, and market demand for the finished array. Affymetrix subsidizes design fees and assists the research community in designing arrays that are then made available for purchase. Later this year, Affymetrix plans to launch several new designs through the Consortia Program. Through this effort, many of the available software platforms have been integrated with Affymetrix data, providing customers with a competitive set of best-of-breed software platform choices.

“With each of our various GeneChip microarrays, we provide algorithms and software tools for analyzing the signals from the chip and ascertaining assay quality,” says Lincoln. The company’s software allows estimation of signal intensities for genes, SNP genotype calls, sequence calls, or binding footprints for ChIP-on-chip arrays.

“As much as possible, these tools are developed in an open manner, allowing both sophisticated customers and software partners to leverage the results and methods, tailoring workflows and applications to specific needs. This allows customers with diverse needs to standardize on the GeneChip microarray platform without being locked in to any software environment,” says Lincoln.

According to Kahn, Illumina has built the foundation around their BeadStudio software package with a premium on openness. “We are definitely not trying to monopolize or dictate the kinds of software tools that people use. We work to ensure we can gather data from all sources and output data in flexible formats that users can select and customize,” says Kahn.

The company also customizes its software with support from several third-party software vendors so that users can work more easily with those applications. “We’re not going to supplant the large amounts of academic and commercial software that is available. We’re looking to leverage what’s there and add value by taking the raw assay data that our technologies produce and transforming it to a form that’s useful for the researcher,” he says.

New Applications
Tiling arrays, aCGH, LOH (loss of heterozygosity), and ChIP-on-chip are among the buzz words in the current research lexicon. They are tools for genomewide analysis of important biological functions, including transcription, protein binding sites, sites of chromatin modification including DNA methylation, and chromosomal deletion. Ultimately, many users want to scan the genome, chromosome by chromosome, looking for abnormalities and inputting the results into an analysis program.

Illumina has been investing in the broad scientific coverage of its BeadStudio software package ways to enable the analysis of data and look for trends that would otherwise be unseen. “This is very pertinent to LOH and copy number work,” says Kahn. The newest version of BeadStudio, launched in May, added functionality for LOH analysis in paired tissue samples.

Illumina's BeadStudio software enables detection and visualization of chromosomal aberrations such as loss of heterozygosity (LOH), duplications, and amplifications.

“If you have diseased and normal tissue from an individual, you can look for specific differences — whether they are insertions, deletions, or copy-neutral alterations of the genomic DNA — and can come to understand something more about the disease state that otherwise would have remained undetected,” says Kahn.

It’s about collecting more information about the underlying cause of a disease to give a deeper understanding that can be acted upon. According to Kahn, the existing algorithms are highly visual and can be best equated to data-mining approaches. Illumina now provides an orientation scheme within BeadStudio so that researchers can observe which parts of the chromosome or genome are affected.

“On one level, the visualization is powerful because you can survey, interact, and use your own perception skills. On the other hand, the challenge of providing tools that can automatically analyze the data and provide the annotations is required to fully unlock the potential of the visualizations,” says Kahn.

“The notion of being able to detect LOH and use that to provide automatically generated annotations is a particularly interesting avenue of research given the kind of data sets people are generating,” he says. While initial algorithms are included in BeadStudio, further research is warranted, and this seems to be an area where algorithms will continue to become more sophisticated to the benefit of the entire scientific community.

“The number of these new applications being used in the cancer research community is amazingly exciting. If you look back on the last six months and what has been done, you start to appreciate how much people are beginning to use these genetic tools to more deeply understand the profiles of these diseases,” says Kahn.

That the company’s BeadStudio suite also supports other applications areas by having many shared user interface components and interaction paradigms is evidenced by the way the data for all assay types are organized into tables and the way tables and graphs are dynamically linked together. BeadStudio supports genotyping, gene expression, and LOH/copy number applications, and will support additional assays in the future.

Agilent also targets its software packages to the end user who is trying to make that biological pathway assessment. “It’s not until you can start to characterize cellular processes in the context of pathways that you’re doing something useful from a discovery science standpoint,” says Agilent’s Stockton.

Agilent Technologies' GeneSpring GX screen showing how genes fit into a biological pathway and their expression levels.

Agilent has a software application, Chip Analytics, to analyze data from ChIP-on-chip experiments. “It’s a statistical tool to help you figure out the likelihood that a binding event occurred and determine what genes are near that binding event,” says Stockton. The company is currently integrating this into its larger GeneSpring analysis platform framework. For examining aberrant numbers of chromosomal segments, Agilent’s CGH Analytics software allows users to find common aberrations across populations.

The newest version of CGH Analytics 3.3 facilitates tumor population studies to make that correlation. Agilent has a large program looking at cytogenetics to build new FDA-validated diagnostics. “We have always been as technologically agnostic as possible,” says Stockton. “In the real world, researchers are using a plethora of technologies. Not all of them are even high-throughput. Often the data that feeds these experiments comes from gels, RT-PCR, or physical observations. You need to be able to mix this data with data from microarrays.”

Beyond Expression
An important feature of a microarray platform to enable newer in-demand applications is content flexibility. Thomas Volkert, director of the Center for Microarray Technology at the Whitehead Institute, uses the Agilent platform for ChIP-on-chip experiments to monitor transcription factor binding across the genome. Unlike traditional expression arrays, the content is geared towards detecting genomic regions that are involved in transcription regulation. The ChIP-on-chip technology has been extensively developed by the Computational Biology Corp., a biotech pioneer that was founded by Richard Young and David Gifford of MIT and the Whitehead Institute for Biomedical Research, as well as biotechnology executive Heidi Wyle. In 2005, Agilent acquired the company, giving them exclusive access to patent and intellectual property for ChIP-on-chip analysis.

Says Volkert: “The ability to design our own content is a major factor in running these experiments and in choosing Agilent’s platform. Not just being able to design it but to be able to change the content frequently as well. The most important thing for going beyond expression analysis and breaking out of the mold of what arrays have been is to be able to design your own content. Companies with array platforms will need to address this issue and make this easy to do in a reasonably cost-effective way.”

The bottom line is that a chip is a chip. “Whether companies say they are for gene expression or for hybridizing DNA, it doesn’t really matter,” says Dana-Farber’s Fox. “We can use them for whatever application makes sense. There is some interesting biology to be learned by using several different approaches — approaches other than what these chips were originally designed for.” 

Deb Janssen is a writer based in Michigan.  E-mail:

Click here to login and leave a comment.  


Add Comment

Text Only 2000 character limit

Page 1 of 1

For reprints and/or copyright permission, please contact Angela Parsons, 781.972.5467.