FPGAs move from financial services to life sciences.
By Kevin Davies
September 15, 2009 | At about 3 cubic feet, the box sitting in a corner of Kumar Metlapalli’s modest office in Andover, Mass., doesn’t necessarily strike me as a “next-generation supercomputing platform.” But the box, or HANSA, might be the biggest thing to hit high-performance computing in a long time.
The founder of Kuberre Systems, Metlapalli is an avid proponent of FPGAs (field programmable gate arrays), a chip that can be programmed to provide far greater specificity and efficiency than traditional CPUs, yet offer a more affordable solution than large grids with hybrid blades a super computers.
HANSA has a scalable architecture that can include from 4 to 64 FPGAs in a 9u cabinet ranging in price from $50,000 to $500,000. It combines a new hardware design and rich software stack for use in the HPC market with a memory that scales up to 256 GB. It delivers the equivalent of a 768-CPU server grid or a 1,536 core supercomputer, at 1/3 the cost, with 2% of the energy requirements, and 1% of the floor space.
While Kuberre has carved a niche in the financial services sector, cracking the life sciences market is both a top priority and a tough challenge. “We’re getting there, but we need to build those relationships,” Metlapalli admits.
Passage from India
An electrical engineer by training, Metlapalli moved to the United States from India in 1991, got his Masters degree in computer science from the University of South Carolina, specializing in image processing. He worked for XyVision before being recruited as a “quant” for Wall Street, predicting trends based on historical data.
The idea for HANSA, which means swan (think “Lufthansa”) but also stands for Hardware Accelerator for Numerical Systems and Analysis, came in 2006, while Kuberre was providing a unified platform for financial markets. Kuberre’s sister company in India had built an accelerator card for BLAST utilizing FPGAs. Metlapalli quickly targeted the benefits of FPGA technology to financial services, but recognized the danger of becoming a black box. Given the programming flexibility of FPGA platforms, Metlapalli opted to design a software stack on top of the FPGAs, so that users can write algorithms in their native languages (see, “Swan Structure”). “We can’t be building singleton solutions,” Metlapalli said. “The library must work across verticals and provide flexibility.”
The HANSA architecture makes use of the scaLAPACK library, originally written for supercomputers or clusters, which breaks up matrices into submatrices, and lets them compute and combine.
One application Metlapalli is convinced will work on HANSA is GWAS (genome-wide association studies.) GWAS calculations are giant matrices with hundreds of thousands of values (SNPs). On HANSA, Metlapalli says “one doesn’t have to take shortcuts. You have 256 GB memory to host the data you need, the compute power you need.”
FPGAs have been around since the ’70s, but seen little application in life sciences. One exception is Scott Helgesen (see p. 34), who featured them in the original software for the first 454 Life Sciences sequencer. The massive parallelism of FPGAs is finding particular use in military applications such as digital signal processing and Fourier transforms.
FPGAs are one increasingly popular flavor of hybrid computing, which complements CPUs with chips such as a GPU or FPGA. Metlapalli says he’s flipped the role of the CPU, so that the CPU becomes the co-processor, and the majority of operations are performed on the FPGAs.
Unlike a CPU with millions of gates die cast, an FPGA has millions of gates controlled via software. “You’re programming the chip to perform what you want to perform, in the most optimal fashion. I take FPGA, put a software code on top of it, and everything runs on the hardware.” More logic in hardware translates to more acceleration. With FPGAs, “essentially you can transform one’s personality based on the application you’re solving.” A binary search might take 1000 gates, but because the FPGA has 1 million gates, one can optimally dedicate a particular number of cores for the search, while performing secondary searches in parallel.
Get a Life
Metlapalli doesn’t want to be constrained to a single industry. “To pick one vertical, we’d be doing a disfavor to the platform,” he says. “I want pharma to know this solution exists.”
One early prospect is an outsourcing vendor in India that works with most big pharmas. Kuberre is putting together a “business initiative document” under NDA. “They already have an idea of what they want to build,” says Metlapalli, indicating molecular comparisons using tools such as JCHEM. “Think of drug discovery as a funnel—the narrower you make the funnel, the faster the process. That requires more sophisticated computation.”
As for genome centers, Metlapalli says, “We strongly feel that the genome centers need a box like HANSA.” Metlapalli says he’s had encouraging discussions with Matthew Trunnell at the Broad Institute, but “the challenge has been allocating the research resources to look and see how the solutions will be built on this platform.”
With HANSA providing the equivalent of 2.5 racks of nodes, 80 inches tall, at your desk, Metlapalli is convinced that HANSA’s efficiencies can knock out clusters. The challenge is in “motivating these people and getting enough of their time to look at the box and build a solution on top.”
Despite all the hype over cloud computing, Metlapalli says HANSA offers a cost-effective alternative by providing the compute processing at the point of collection. “Take it, collect it, process it… It has the computational power to bypass cloud computing.”
“If you have a cluster or cloud, HANSA could be one of the nodes on that. If you need a departmental supercomputer, this is what you need.” It provides the equivalent of 1500 processors or 700 blades.
For example, Metlapalli claims HANSA offers a 1000-fold improvement in the BLAST search algorithm. Based on work for a previous client on a single board with 4 FPGAs, Metlapalli saw an 80X performance boost. “We have 16 boards in HANSA. So it’s 16x80, or 1280 or so.” If you take out the latencies between the boards, maybe 1000X. But life science customers “really don’t care” because BLAST makes up a small piece of their workflow. “They’d rather know how HANSA can solve their own workflow issues.”
As a privately funded company, Kuberre runs “a very lean and mean operation,” which is why Metlapalli is reluctant to build demo units. Instead, he challenges potentially interested researchers: “Give us a problem you’re not able to solve. We’ll do the legwork, build the prototype. Tell us that you’re going to buy it! That’s all we need. Just need an hour’s worth of time, saying what the problem is, give us the sample data, this is how the algorithm should work. Boom! We’ll do the rest.”
The HANSA architecture consists of four layers. On the bottom is the physical hardware—16 boards, each containing four FPGAs. (Each FPGA has 12 processors talking to one memory bank, 12 talking to the other.) The next layer consists of expandable firmware building blocks (for example a binary search algorithm), so users do not have to deal with VHDL. Then comes a C/C++/JAVA API layer, so one of these APIs could be used in multiple building blocks underneath it to execute the programs. The icing on the cake is the user’s own applications and custom algorithms.
“What we’ve done is provide level of flexibility they need to build their own algorithms in their own native languages,” says Metlapalli. “No one has thought about building a supercomputer utilizing so many FPGAs together in a single box, or how to utilize with a software stack to solve problems.”
Out of the 16 FPA boards in the box, five could be doing Monte Carlo simulations, six doing intense numerical algorithms. The other boards might capture streaming data. “That’s what you can partition through the software. In one box, you’re dividing the personalities of HANSA into sub personalities.” One might be numerical algorithms, another might be pattern matching.
HANSA contains programming capability for C/C++, MatLab, R, and Java. “Imagine running 768 legacy C/C++ programs in parallel without having to make any changes to the legacy code, just do a recompile,” says Metlapalli. Users might want a core library such as BLAST, Smith-Waterman, etc. “We don’t want to build the entire conformation on the FPGA side. I want to provide a library so they can write their own algorithms.” Kuberre provides the ScaLAPACK Library for use out of the box. “But if you want your own custom algorithms, we’ll build those for you.” K.D.
This article also appeared in the September-October 2009 issue of Bio-IT World Magazine.
Subscriptions are free for qualifying individuals. Apply today.