YouTube Facebook LinkedIn Google+ Twitter Xingrss  




By Michael Athanas


February 18, 2004 | If informatics computing on loosely coupled dedicated servers (clusters or compute farms) is such an attractive solution, why are life science IT shops still blowing big bucks on refrigerator-look-alike symmetric multiprocessor (SMP) machines?

The main attraction of cluster computing is low upfront cost: 4 to 50 times cheaper for the equivalent processing and storage capability of an SMP behemoth, by some estimates. But the true beauty of a cluster is its scalability. (Of course, scaling is dependent upon making correct choices of hardware and networking architecture, but more on this in a future column.)

Another important attraction is ease of management: A 100-node properly configured cluster requires about the same effort to manage as a single SMP refrigerator. It is also not as easy to build a bad cluster as it used to be. Commodity "*nix" operating systems such as Linux and OS X have evolved with clustering in mind. Likewise, the robustness of distributed load management software such as Sun Grid Engine or Platform LSF is improving to the point of becoming commodified, and can (or should) be considered part of the OS.

Making real scientific use of a configured cluster is a challenge. The saddest sight in a data center is a finely tuned cluster, using megawatts of power and tons of cooling, spending most of its time calculating zero. Idle clusters should be a punishable informatics crime. Part of the cause is the wide communication gulf that often separates scientists from IT staff.

Corporate politics aside, the most challenging problem in making effective use of a cluster architecture is application enabling. That is, how to have applications, pipelines, and workflows take advantage of boundless scalable cluster computing infrastructures. Enabling applications for cluster computing requires an understanding of the basic execution flow of the application. Two categories of applications often fit well into the parallel, distributed environment: data-driven stream processing and parameter search algorithms.

Many sequence analysis applications fall into the data-driven stream-processing category, such as correlation matrix calculations for array analysis and NCBI's BLAST. In both cases, data from a reference source are streamed through the core algorithm to be compared and analyzed. A non-parallelizable component of the execution flow assembles and processes the final results. Many commercial and noncommercial "accelerated" BLAST algorithms are available.

Systems modeling is an example of a parameter search algorithm. In this case, a vector of model parameters is tweaked to optimize the model response with actual data. Individual model calculations can be executed in parallel.

To take advantage of cluster computing, informatics developers must learn to decompose complex, monolithic problems into manageable, well-defined pieces. For example, instead of loading all of GenBank into a single Perl hash table for sequential processing, it may be prudent to structure the implementation to process a single sequence analysis. Doing so makes it feasible to analyze many sequences in parallel.

Countless tools and techniques are available for transforming a monolithic, nonthreaded application into a distributed solution, including such low-level tools as message passing interface (MPI) and parallel virtual machine (PVM). Unfortunately, these tools can also transform an application into a quagmire of complex, unmanageable, nonportable code, so take care when using them. Quite often, the parallel and nonparallel components can be separated so that a load management system can perform the execution transparently, as if executed on an SMP refrigerator look-alike.

Don't divorce your rusty SMP behemoth just yet. Cluster computing infrastructures can't solve all scientific computing problems, but they can relieve your enterprise SMP machines of mundane processing and extend their lives. The extra development effort to enable applications for cluster processing pays off quickly as the gap toward boundless scientific computing is narrowed.



Michael Athanas is a founding partner and principal investigator at The BioTeam. E-mail: michael@bioteam.net. 


ILLUSTRATION BY TIMOTHY COOK
 


For reprints and/or copyright permission, please contact  Terry Manning, 781.972.1349 , tmanning@healthtech.com.