June 8, 2011 | Todd Scofield’s desire for a high-performance computing (HPC) consortium to tackle problems in life sciences came out of his family’s encounters with debilitating disease. His wife was diagnosed with multiple sclerosis (MS), and his father died of Alzheimer’s disease. Thanks to the benefits of hybrid computing and a consortium to harness its power, he may have the means to make a difference.
Scofield’s early career centered around commercial data warehousing, but his interest in HPC grew as he learned the capabilities of GPUs (graphics processor unit) and FPGAs (field programmable gate arrays). “When I discovered these devices, I wanted to [use them to] study these diseases. When I heard about them being 50X faster, I investigated and found they were a lot faster. I heard one customer comment, ‘It’s the first time a salesperson lied and I had a smile on my face!’ It wasn’t 50X faster but 350X faster.”
In 2006, Scofield and his collaborators established the Data Intensive Discovery Initiative (Di2, http://wings.buffalo.edu/dataintensivecomputing) at the University of Buffalo. The goal of the Di2 consortium, says Scofield, is “to enhance discovery through data intensive computing. For this, we have some of the best hardware resources (in academia) anywhere in the country.” He calls Di2 “a computational basis of applying different computational architectures together in one location to do advanced science in the biopharma space and health care.”
Scofield hopes to forge new collaborations that focus on engineering applications, and eventually to commercialize some of the results through his company, Big Data Fast, “taking what we learn at Di2 (and other areas) and bringing that into the commercial domain.”
The computing in question is a form of hybrid computing (see, “Life Scientists Get Their Game Faces On,” Bio•IT World, April 2008) in which massively parallel data intensive supercomputers (DISCs) are married in a single platform with traditional high performance computer (HPC) and GPU cluster supercomputers, providing impressive increases in processing speed than traditional clusters.
Scofield says he wants to “break down knowledge in different domains and share it.” Many algorithms used in physics, for example, have direct applicability in the biopharma space, he says. “We typically deal with very large datasets. Now we can parallelize the computing at the data, so the computing is being done at the data, rather than taking the data to the computing.”
The Di2 Difference
Partners in Di2 include the University of Buffalo, Howard University, Georgia Tech, Buffalo State University, Canisius College, Roswell Park Cancer Institute, the Hauptman-Woodward Institute, New Mexico State and Sandia. The initiative has received more than $5 million from the National Science Foundation and other sources. Some of the early development systems were placed at Howard University, Sandia Labs and Georgia Tech, but the flagship Buffalo entity has an 80-teraflop HPC cluster consisting of some 6,000 cores and a 128-teraflop GPU cluster for compute intensive applications. Racks of DISCs are currently supplied by IT infrastructure firms Neteeza (IBM) and XtremeData. “Each platform has a good sweet spot,” says Scofield.
The DISC architecture allows each device to be an appliance, says Scofield. “So we don’t have to worry about SQL infrastructure associated with regular computers. We don’t need a separate database because it’s already built in.”
The operating model is similar to an NSF cooperative, says Scofield. “We have a model for the industry-academia research consortia at the Department of Pharmaceutical Sciences at the University at Buffalo that are funded by numerous top pharma and biotech companies. We’ll have partners on the basic side and projects that are proprietary.”
Vipin Chaudhary is the co-founder and director of the Di2 and an associate professor of computer science at the University at Buffalo and SUNY. He is also CEO of an Indian company, Computational Research Laboratories.
Chaudhary is an expert in these new data warehousing technologies. “They’re like active disks, they allow computing to happen very close to the disk,” he explains. “Both have FPGAs that can be oriented to do computing close to the disk. If you’re trying to search for something in a large database with a normal architecture and typical SANs, all the data go into processors, then you search for it. The bottleneck is the network between the compute and storage.”
Using active disks, by contrast, data are distributed to various disks, enabling users to search in parallel at the disk. “The bottleneck is no longer in the network. You don’t need to get the whole database to the processors; you just need to send out what you’re looking for to these storage units. We can get speed ups greater than 1,000X for certain problems,” he says.
A single rack contains 100 compute nodes; each node has a disk drive, FPGA, memory and a CPU. The first step is that the FPGA does a data reduction. The data are spread across the rack, such that each node works with just 1/100 of the data. The FPGA discards 95% data that are not relevant to the query, with only the key data entering the memory to be processed by the CPU. “This is like Hadoop on steroids,” says Scofield. “In the world, these devices only exist together in Buffalo.”
Last summer, Scofield and colleagues published a benchmarking study in IEEE Scientific Computing on microarray analysis. According to the report, the architecture of the XtremeData system resulted in a 50-100X acceleration compared to a regular cluster.
The advantages of this architecture extend to routine applications, not merely data or compute intensive applications. For more intensive problems, Chaudhary has been using GPUs to do numerical or scientific computations. “We’ve acquired a large cluster of 1-teraflop Nvidia processors, put together with data-intensive and SSD [solid state disks] storage. We’ve created an environment to do very high-end, computationally intensive and data-intensive jobs. We’re integrating all of that as infrastructure, both in terms of hardware and software.”
There are some stiff challenges on the software side. Chaudhary admits that programming GPUs is not easy. Moreover, the active disks are typically programmed in a data warehousing manner, but scientific programmers don’t program in that manner. “We’re looking at non-SQL people trying to write scientific code using the SQL framework. These environments are very different. We need to create a new environment,” he says.
The co-director of Di2 is Murali Ramanathan, a State University of New York faculty member in pharmaceutical sciences. Scofield and Ramanathan were introduced in 2005. “His novel algorithms for gene-environment analysis work particularly well on these devices,” says Scofield.
Ramanathan studies gene-gene and gene-environment interactions in MS, particularly aspects of the environment—lifestyle factors, infectious agent exposures, pollution, food—that could impact a patient’s disease course. “Environmental factors are the dark matter,” he says. The “interaction problem,” as he calls it, is fundamentally combinatorial in nature. “It’s important to have good algorithms that exploit the best [compute] architectures to do gene-environment interactions on a very large scale.”
Ramanathan reports good early success in applying very fast algorithms to the study of gene-environment analysis, optimizing search strategies to a host of different datatypes. “We now know interaction analysis metrics that lend themselves to DISC architectures. We have all the key ingredients in place” to apply to MS and other complex diseases, he says.
In time, Scofield hopes to find more scientific partners and non-profits to supply projects to Di2. “We want to draw from problems in different domains. Many algorithms can have unexpected uses in others,” he says. One notion is to develop large databases of curated data on neurological diseases, and then wrap around disease-specific databases. “That would allow researchers from anywhere to have the same capability as the big science centers. That would be really powerful,” says Scofield.
Another hope is that new software technologies that are being developed at Di2 could be utilized by other academic groups. “If we have all these hardware technologies and can use new software technologies, we could help accelerate discovery at other institutions where they may not know there’s a software technology that can help.”
In 2008, a major advance in supercomputing design was reported by D.E. Shaw Research with the design and deployment of Anton. Scofield says Anton utilizes special code and massively parallel, custom application specific integrated circuits (ASIC) chip architecture to perform molecular dynamics simulations. “Our researchers can analyze their simulation output data sets better than they can! DISC machines are more effective platforms to analyze those types of tera-scale data sets,” he claims.
It all comes down to fast computing on big data sets, says Scofield. “The driving factor is, can we have a positive social impact on many people?” •