By Rabiya S. Tuma
July 11, 2002 | Academic researchers have already demonstrated the value of micro-arrays in cancer biology. To date, the largest number of tumor samples scientists have analyzed in one microarray study is just over 200.
Now the International Genomics Consortium (IGC), a freshly minted public-private consortium, aims to perform gene expression profiles on 10,000 tumor samples during the next three years, along with 1,000 disease-free control samples and 1,000 intra-epithelial neoplasia samples. The nonprofit group plans to make its data freely available to the public with the idea that other research teams will find creative and valuable ways to use it.
To complete this massive microarray effort, the organizers of IGC, including Jeffrey Trent and Andy Baxevanis, both
| Daniel Von Hoff, director of the Arizona Cancer Center, advocates informed patients.
at the National Human Genome Research Institute (NHGRI), and Daniel Von Hoff, director of the Arizona Cancer Center, have built a large, diverse team. Nineteen academic hospitals and community cancer clinics have signed on to collect patient samples. Numerous pharmaceutical and biotech companies have jumped on board as well, promising to supply reagents, microarray technology, and the $42 million necessary for the group's first three years. (The state of Arizona is close to raising enough public and private funding to house the IGC headquarters — and lure Trent back to his home state.)
In return these academic and industry groups will get access to the data generated by the consortium along with everyone else.
In fact, modeling the effort after other public-private consortia such as the Human Genome Project and the SNP Consortium Ltd., Trent and company say that one of IGC's main goals is to ensure that the generated data are put in the public domain with no preferential access for the contributors and partners.
The organizers expect the scientific community, both academic and corporate, to support the effort and participate in order to improve scientific resources for everyone. If that sounds lofty and idealistic, it is.
But some companies see the cost of participating as well worth the unique scientific resources IGC will generate, particularly in light of the extra visibility they can gain through participation.
"It is something that from both an IT perspective and a biology perspective would be extremely costly to go it alone. I am not even sure it would be feasible," says Albert Wang, a manager with the drug discovery informatics organization within Bristol-Myers Squibb. But the benefits, he says, could be huge. "It is really going to help the science. It is going to make it easier for all of us to do better work."
Computing the Strategy
|IGC by the Numbers
|A new collaborative research effort called the International Genomics Consortium hopes to perform the largest number of gene expression profiles of tumor samples ever. The group's goals include:
|Time frame of the project:
||To be completed during the next three years
|Number of tumor samples to be studied:
||10,000 (about 5 times more than any previous study)
|Number of disease-free samples to be studied:
|Other items to be studied as part of the project:
||1,000 intraepithelial oplasia samples
IBM Corp., for example, hopes to be one of the major partners on the computing side of the effort. Though the company will not receive any direct financial compensation for its contribution, company officials acknowledge that participation in IGC will increase its visibility in the bioinformatics market, and may prove profitable in the long run.
"We see IGC as a leading-edge organization for the integration of clinical and genomic information to advance cancer research," says Pete Morrissey, the worldwide manager of biosciences for IBM Life Sciences. "So if IBM can be a significant player with IGC then that advances not only our business value, but certainly our mission of becoming a premier provider of informatics hardware infrastructure in the life sciences."
The computing systems required for IGC will be significant, says Baxevanis, who is currently the director of computational genomics at NHGRI and oversees all aspects of IGC that have to do with bioinformatics and information technology, including both hardware and software.
According to Baxevanis, the scale of the computing needs for IGC is "a completely different duck" than that of the Human Genome Project or the SNP Consortium database. In both of those cases, the data are simply a collection of single letters, albeit a big collection.
By contrast, each patient sample analyzed by IGC will generate a microarray consisting of 10,000 different spots, each carrying its own intensity value. Additionally, each patient sample will come with important clinical information that will need to be maintained so that researchers down the line can take it into account when they analyze the gene expression data. "It is not a lot of [clinical data]," says Baxevanis, "but enough to make the sample useful to the people who are going use the data."
Data will accumulate quickly, and the storage system must be robust enough to handle it. Baxevanis expects each patient sample to generate gigabytes of data. Multiply that by the number of samples the group aims to collect over the next three years, and IGC will end up with a terabyte-sized database of information that needs to be efficiently stored and accessed.
Baxevanis is not daunted by the task, however. "I don't think it is something that is technologically impossible," he says. "I just think no one to this point has tackled it head-on for a project of this size."
One of the most important aspects of the project the system designers need to address is patient privacy. While similar data protection issues have already been solved in the financial industry, Baxevanis says that IGC not only needs to come up with adequate computing solutions — many of which will be adapted from the financial sector — but also to convince regulators and institutional review boards from participating health-care centers of the solutions' adequacy.
Right now all of the hospitals and clinics partnered with IGC are U.S. institutions, but IGC plans to expand its effort internationally. Patient consent forms and privacy arrangements will conform to both the Health Insurance Portability and Accountability Act (HIPAA) and European Union regulations.
Patient identity will remain with the host institution, but IGC wants to build a system that allows the host institution to update patient information. That, say the organizers, will provide researchers with longitudinal data and increase the value of their work. With such updates researchers will be able to correlate gene expression patterns with treatment response or long-term survival data, even if those patient characteristics aren't known when the tumor sample is originally donated.
Why Size Matters
So why go to the effort of amassing all of this gene expression data in one place? What does it provide that a lot of smaller studies, such as those already under way in many academic labs, can't?
The answer is simple: Sample size equals power.
With gene expression patterns from a vast number of cancers, researchers will be able to detect subtle patterns that distinguish different types and subtypes of disease. With that information, they will in theory be able to develop more accurate diagnostic protocols and more effective therapies.
The largest microarray cancer study so far, in which researchers from the Dana-Farber Cancer Institute in Boston analyzed 218 samples from 14 cancer types, illustrates the point. Sridhar Ramaswamy and his colleagues analyzed the expression levels of more than 16,000 genes from the tumor samples and 90 normal tissue samples. When they compared expression profiles from tumors of unknown origin to their newly generated database, they were able to identify the tissue of origin with 90 percent accuracy, as reported in Proceedings of the National Academy of Sciences in 2001. Although the "diagnosis by database" wasn't perfect, the authors suspect that the accuracy would improve if more known samples were added to the database.
"You actually need large numbers of samples to get highly accurate predictions," says Ramaswamy. With this in mind, Ramaswamy says he's excited about the IGC project. After all, imagine the accuracy and power one can achieve with a fiftyfold increase in sample size.
According to Ramaswamy and IGC officials another complication in the field of gene expression profiling is that all-encompassing standards don't yet exist; data generated in one study can't be combined or effectively compared with that from another study.
IGC expects its effort will improve the current situation in two ways. First, its data will be generated in a uniform manner, with standardized protocols at every step of the process, from tissue handling to microarray hybridization to data storage. This means that samples collected and processed on day one will be comparable to those on day 22, 312, or 594 of the effort.
But more than that, the group expects that the size of its effort and its conformity with Microarray Gene Expression Data Group guidelines will compel other researchers to fall in line with the guidelines and thereby lead to more uniform, comparable data.
Putting the Plans in Motion
IGC organizers have been publicly talking about their plans for more than a year now, and slowly their ideas appear to be taking on a concrete form. The group completed a pilot project producing microarray data from approximately 100 samples, proving to potential funders that it can do what it says it can.
The group has lined up what it estimates to be more than enough health-care providers to collect patient samples. IGC approached local cancer clinics and regional cancer centers to gather a variety of early, relatively untreated samples and samples from late-stage, chemotherapy-resistant tumors.
IGC is still finalizing the necessary funding for the project, estimated to be around $42 million for three years. Although industry might not be clamoring at IGC's doors to make large donations, companies approached thus far do seem enthusiastic. Based on this response and the team's success in funding the pilot project, the group is optimistic that it will be able to raise the full amount.
One advantage of the IGC model is that once the tools are in place — including patient consent and computing protocols — scientists can focus on any genetically based disease. The project described here, called expO for expression oncology, is aimed at cancer, but individuals at IGC are looking at ways to use the infrastructure to study Alzheimer's disease.
IGC won't be doing the scientific research on the raw data it generates, but officials will be around informally to help others in their efforts. Baxevanis says that he expects IGC will become a resource to help scientists develop algorithms and protocols for mining vast quantities of data and will generally be available to share its expertise.
With terabytes of data available, the group knows it will need a high-speed computer system to sort through it all. IGC is working with IBM to build a supercomputing facility that will be dedicated to IGC data. Access to the supercomputer is the only situation in which IGC partners will get preferential treatment. Under the current arrangement, contributors will have access to the supercomputer facility for free. Nonpartners will also be able to get computer time, assures Baxevanis, but they might have to pay for it.
At the end of the day, IGC organizers are very clear about why they are devoting so much energy to getting the massive project off the ground: the patients. When Arizona Cancer Center's Von Hoff introduced the project at the American Association for Cancer Research in April 2000, he stressed that the problem with molecular approaches today is that not enough patients have access to them or can benefit from them. The rich and well-informed patient can ask for Her2/Neu testing (used to check for viable candidates for this antibody therapy aimed at breast cancer) or the like, but the average patient doesn't have such luxuries — luxuries that might save his or her life — because of the high cost and the limited number of labs that perform such tests.
For this reason, Von Hoff emphasizes that he hopes patients who donate tumor samples will benefit directly from IGC's effort. And though he acknowledges that it is not likely in the early stages of the project, the group hopes to be able to some day funnel individual patients' data back to their doctors so that their treatment can be optimized for them. "The guiding principle associated with this project is that we hope that anything we do will directly benefit patient contributors," says NHGRI's Trent.
Rabiya S. Tuma is a science journalist in Brooklyn, N.Y., specializing in genomics and cancer biology.