By Salvatore Salamone
February 10, 2003 | Most life science companies size up clusters or grids in terms of the number of floating-point operations per second, or flops, (flops) they deliver, but there is a growing trend to look instead at the number of flops per year, or FLOPYs.
“The progress of science and the quality of research are strongly related to throughput,” says Miron Livny, a professor in the Computer Sciences Department at the University of Wisconsin. “We should be looking at how much can we push through a system in a sustained way… It’s not how fast can I do [something], but how much can I do by a certain deadline.”
Other scientists echo his sentiments. “High-throughput computing is the key,” said Bruce Ling, Tularik Inc.’s director of bioinformatics, in a talk at the recent Executive IT Life Science Forum, held in New York City. “That means the effective management and exploitation of all available computing resources.”
Specifically, Ling, Livny, and others are pointing out that in silico scientific research requires a combination of high performance and high availability.
“There’s a good sports analogy to highthroughput computing that people should keep in mind,” says systems consultant Raymond Lopez. “You can have a team of superstars, but if some are always out with injuries, others get into fights on the court and are suspended, and others simply perform poorly because of personal problems, you don’t win as often as you should.”
Likewise with high-throughput computing (HTC). “If you have computers that are not available because they break down, jobs that cannot run because they lack the right number of software licenses, or a computer is just not tuned to maximum performance, you run fewer computational jobs in a given period of time.”
Optimal HTC requires everything in a cluster or grid to be in good working order, meaning each time a scientist submits a job, the computers and software are 100 percent available.
But ensuring systems availability in a distributed computing environment is not trivial. Managers have typically turned to software to help keep their systems and the computational jobs running.
Many organizations use job management and scheduling software from companies like Platform Computing Inc. or TurboWorx Inc. to make sure that each computational task has the appropriate resources such as an adequate number of software licenses of analytical applications.
Additionally, companies like the consultancy BioTeam Inc. have advocated a unique approach to high availability. BioTeam puts clusters on a separate network so they are immune to the normal company network traffic and operations. The idea is a cluster should operate as a dedicated scientific computing system. Such dedicated systems would likely be better managed and have less downtime than a bound-in traditional cluster, which taps unused systems spread throughout a lab or company.
But even with better management tools and dedicated systems, other availability issues must be taken into account. One is software portability. “If you’ve optimized an application for one platform and you install a new platform and you can’t run it for two months, that’s a throughput problem,” Livny says. He notes if you can’t run a job for a few months, the in silico research does not get done.
Sustaining throughput gets even more complicated in grids. “The challenge for grids has been how to deal with scheduling and fault tolerance,” says Daniel Reed, director of the National Center for Supercomputing Applications.
Grids can pose a tougher availability challenge to managers. For instance, if a server in a cluster freezes up, someone in that lab can simply reboot that server and let the computation resume. But when problems arise in a geographically distributed grid, restoring the system is no easy task. “There’s no single reboot button,” Livny says.
Sidebar: Flipping FLOPYs
One way to measure a cluster’s rate of throughput is to calculate FLOPYs -- the number of floating-point operations a system can complete in one year.
FLOPYs are calculated by multiplying the number of system FLOPs (floating-point operations per second) times the number of seconds in a year (31,536,000), assuming the cluster is running 24x7 at 100 percent CPU capacity. For example, a 100-gigaFLOP cluster would be rated at 3.154 exaFLOPYs (3.154 quintillion FLOPs per year). But note that, like fuel economy, mileage may vary. Most systems run, on average, at about 20 percent CPU utilization, not 100 percent as this calculation assumes.