By Salvatore Salamone
February 10, 2003 | Two major grid initiatives -- the TeraGrid and the Dynamic Data-Driven Application Systems (DDDAS) program -- will expand the role of grids from CPU cycle aggregators to infrastructures that provide access to distributed data and, in some cases, actually connect the instruments that generate the data itself.
Grids have typically been used to deliver high-performance computer processing power for scientific calculations. But both the TeraGrid, which is expected to be online this summer, and the National Science Foundation’s DDDAS program, which is just starting, plan to move beyond this simple computing paradigm. Much of the work these large-scale initiatives accomplish will have direct application to common problems found in many life science organizations.
“Grids are moving beyond aggregation of computing resources,” says Daniel Reed, principal investigator of the TeraGrid project. He says that when it comes to dealing with large databases, “the question becomes, ‘Do I replicate it or access it remotely?’ ”
Many life science companies spend a great deal of time curating and annotating data from public databases. Once these processes take place, copies of the databases are distributed to various laboratories or sites within the company. The problem is, very large databases require multiple large storage systems distributed throughout a company. Moreover, the frequent moving of copies to multiple locations results in communication fees. And there’s the additional problem of synchronization: If constant changes are being made, how does a scientist know that he or she has the most recent data?
Reed and others in the grid community argue that it might be easier to simply maintain one centralized database and give scientists access to the data using a grid. The Department of Energy’s TeraGrid project, once operational this summer, will put this concept into practice.
“With the TeraGrid, we’re not talking about a distributed computer, but rather a distributed system that uses grid technologies,” Reed says. “[The TeraGrid] will enable and empower new science by providing remote access to distributed data archives and computers.” But it requires lots of power. The TeraGrid will have more than 20 teraflops (a teraflop is 1 trillion floating point operations per second) of processing power and about 1 petabyte (1 quadrillion bytes) of storage capacity distributed over five sites.
This raw processing and storage capacity is significantly greater than was originally planned (see “The Big Grid: DOE Will Deliver 5 Teraflops,” Bio-IT World, May 2002, page 18), due to a recent NSF decision to combine the original TeraGrid project with a Pittsburgh Supercomputing Center terascale computing system project. Once completed, the combined effort will let researchers access computer resources and databases distributed throughout the TeraGrid.
Under construction are Linux clusters that are being installed and tested within each of the five supercomputing centers that will comprise the grid. These centers include the Pittsburgh Supercomputing Center at Carnegie Mellon University and the University of Pittsburgh, the National Center for Supercomputing Applications at the University of Illinois at Urbana-Champaign, the San Diego Supercomputer Center at the University of California at San Diego, Argonne National Laboratory, and the Center for Advanced Computing Research at the California Institute of Technology.
While the TeraGrid will be a huge scientific research system, some see similar principles and systems in private-sector life sciences. “What about using grids in the regulatory environment?” asked Howard Bilofsky, director of knowledge and information technologies and alliances in the R&D IT group at GlaxoSmithKline. Bilofsky’s comments came at the recent Marcus Evans Executive IT Life Sciences Forum.
“The FDA is increasingly interested in analyzing [clinical trial medical imaging] data itself,” Bilofsky said. “Should we send them CD-ROMs with a couple of selected images? Or, in the future, could a grid be used to give the FDA access?”
Bilofsky noted that these topics are being discussed within life science industry technical groups, including the Pharmaceutical R&D IS Managers Forum. But he also said that it is premature to talk about such things happening anytime soon. “One area that still needs to be explored is whether current grid security satisfies HIPAA [requirements],” he said.
Other possibilities for grids are being explored by DDDAS, an NSF initiative to develop next-generation distributed application software.
“Today, simulations are more or less static. This will not be the case in the future,” says Frederica Darema, a senior science and technology advisor at the NSF. “DDDAS next-generation software will include dynamic resources in a grid. And systems architectures built on DDDAS will incorporate interactive visualization systems and measurement systems.”
One practical application of the DDDAS approach would be to connect lab equipment to a grid. The grid could automatically capture results from experiments that need its processing power to analyze or visualize data. This concept of providing access to lab equipment is shared by the TeraGrid project and by other researchers.
“I see grids becoming multimodal in the future, where they’ll combine access to computational resources and data, as well as connect the instruments that collect the data,” says Greg Jones, director of scientific computing at the University of Utah. Jones says that all ranges of instruments could be connected to a grid, from common clinical trial diagnostic tools such as EKGs all the way up to high-end equipment such as devices that perform magnetoencephalography (MEG). “There are about 10 [high-end] machines in the U.S. that do MEGs,” Jones says. “Providing access to such a machine through a grid would be ideal.”