By W. Bryan Wadsworth
April 7, 2002 | In December 2000, a group of more than 60 North Carolina businesses, universities and nonprofits formed the North Carolina Genomics and Bioinformatics Consortium (NCGBC). Researchers working on the Human Genome Project were close to publishing their findings, and consortium members knew the information would revolutionize the field of genomics — opening up new areas of study and promoting advances in medicine, agriculture and environmental science. But the consortium soon realized that this boon to research would also precipitate an IT crisis.
The new genomics data being loaded into a single public database, the National Institute of Health's GenBank,
|The Five Watchwords of Grid Architecture
|IBM's Steven Beckhardt and MCNC's Thom Dunning offer grid architects five words to live by.
was expected to double every six months after the first year. The NCGBC estimated that computer systems capable of storing and analyzing the genomics data generated during the next five years would require 100,000 times more disk space and processing power than the average desktop computer. And, given the increasingly interdisciplinary nature of genomics research, the need to integrate and manage rapidly expanding and geographically scattered databases increased the anxiety of consortium members such as Glaxo-SmithKline, Biogen Inc., and SAS Institute.
Since few individual organizations could afford supercomputers, the consortium focused on distributed computing as the most obvious solution to the coming IT crisis. The question that remained, says Thom Dunning, vice president of high-performance computing and
communications for consortium member MCNC, was: "How can you make the most effective use of a distributed computing system and minimize the computer knowledge researchers need to have [in order] to use that system?"
The answer was grid computing.
Assembling the Grid
MCNC — formerly Microelectronics Center of North Carolina — is a nonprofit organization that was founded in 1980 to help link computing resources at the state's universities. The infrastructure MCNC built to achieve this goal includes the North Carolina Research and Education Network (NCREN) and the North Carolina Supercomputing Center (NCSC). With MCNC's resources, the consortium had the network backbone and some of the processing power it would need for a strong, stable grid.
Grid computing has received serious attention in the life sciences since its origins in the early 1990s because
Troublesome security questions must still be resolved, says Thom Dunning, BioGrid project manager.
it offers a way to deliver supercomputing power to users in multiple locations without the six-figure costs associated with a dedicated supercomputer. The concept arose from experiments that enabled scientists to access idle desktop PCs via the Internet, most famously in the data-intensive research project SETI@home, which combines the processing power of more than 500,000 PCs to analyze radio signals for signs of intelligent extraterrestrial life.
Unlike the related distributed computing strategy known as clustering, in which a group of networked computers share parallel processing responsibilities for a dedicated task, grid computing allows multiple jobs to run simultaneously. Grids often incorporate clusters at various sites.
Combined processing power is not the only benefit of grid computing. Even more important, says Dunning, is a grid's ability to offer users a unified view of disparate datasets. Users access what appears to be a single database, though the data may be physically distributed across several machines. This was a critical factor for the consortium in opting to implement a grid, Dunning says, because many biology researchers "are reluctant to learn all the arcana related to computer systems. The grid was a mechanism to provide computing capabilities in a form about as simple as one can be."
The consortium's North Carolina Bioinformatics Grid — or BioGrid — has three goals, says Dunning. The BioGrid will offer a unified view of the genomic and proteomic data stored at sites connected via NCREN. It will provide the computing resources needed for analyzing that information (via NCSC or state universities). The BioGrid will also be constructed to take advantage of each university's research emphasis. For example, North Carolina State University, which focuses on forest and plant genomes, will maintain plant-related data for BioGrid users.
MCNC will act as the project manager in constructing the BioGrid. Dunning and his chief architect, Phillip Emer, will oversee the various working groups responsible for technical areas such as computing, data storage, networking and middleware. IBM has been invited to collaborate with MCNC in a number of these groups; as Dunning notes, the computer manufacturer not only has expertise in high-performance computing and data storage, but in distributed databases (through its acquisition of Lotus Notes developer Iris Associates) and grids as well. IBM has already built grids both internally and in Europe.
Steven Beckhardt, chief architect for IBM Life Sciences, will work closely with MCNC's Emer, who has the final word on infrastructure decisions. "I wanted to work in an area that was at the intersection of distributed computing and life sciences," Beckhardt says. "This is one of the premier projects at that intersection." Along with providing scientists and engineers, IBM expects to supply infrastructure components such as its Regatta eServer p690 for operations, Shark Enterprise Storage Server for storage, Tivoli Storage Manager software for data security, WebSphere software for data access, and DiscoveryLink technology for integrating heterogeneous data into a virtual database.
When the project is completed — sometime in the next three years — the BioGrid should deliver supercomputer-level performance of 1 teraflop (one trillion floating-point operations) per second and a storage area network with a capacity of 1 petabyte (equal to 1 billion 400-page books). Since the state's universities have expressed to Dunning the importance of protecting their investments in installed computers, the BioGrid will support all major operating systems, including Windows, Macintosh and Unix. BioGrid middleware is being developed in open-source format using the Globus Toolkit, the standard for early grid efforts.
Caution: Reality Ahead
The chief concerns most organizations have about grids are network bandwidth and data security. NCREN is currently undergoing an upgrade that will ensure a minimum data transfer rate of 155Mbps (OC-3). With Duke University, the University of North Carolina at Chapel Hill, and North Carolina State already connected by an OC-48 ring (nearly 2.5Gbps), BioGrid bandwidth is a nonissue for the time being.
As for security, the BioGrid will be implemented over NCREN rather than the Internet, which eliminates some fears, but the fact remains that NCREN is a public network. Its existing public key infrastructure (in which matching "keys" are required to unlock encrypted data) will not satisfy everyone — particularly pharmaceutical firms that need to protect their intellectual property.
Dunning says that some companies have already expressed concern about which encryption schemes will be used. "Companies are very interested and understand that the grid will be a very valuable internal tool," he says, but until some key have been answered "they are holding back on exactly how much they will participate."
The BioGrid is in the early stages of its first construction phase, the testbed. This phase, expected to be
|The BioGrid at a Glance
|Expected processing power: 1 teraflop per second
Expected data storage capacity: 1 petabyte
Expected data transfer rate: Between 155Mbps (OC-3) and 2.5Gbps (OC-48)
Estimated cost: Undetermined ($1.5 million for 18-month testbed phase)
Estimated timeframe: 12 to 18 months for testbed; another 18 months for full implementation
For progress reports and other information, go to www.ncbiogrid.org.
complete by the end of 2002 or mid-2003, will help MCNC understand the universities' local hardware and software needs, and what types of middleware will need to be developed. It will also provide a better sense of the time and costs required for full implementation; Dunning estimates that 18 months of work on the testbed alone will require $1.5 million in funding, which will come out of MCNC's endowments. The nonprofit was in discussions to bring Sun Microsystems Inc. into the project along with IBM back in January, but Dunning says no other requests for proposals will go out until the testbed is finished.
With network infrastructure and supercomputing resources already in place and centrally managed by MCNC, IBM's Beckhardt says the BioGrid will face only a minimal amount of political issues surrounding software installation. "That's why North Carolina is a beautiful place to do this," he says. If all goes well over the next year, Beckhardt and Dunning say they believe the BioGrid will earn the state international recognition as a beautiful place to conduct all types of genomics research.
W. Bryan Wadsworth is a writer based in Sagamore Beach, Mass. He can be reached at firstname.lastname@example.org.
PHOTO BY SIMON GRIFFITHS