July 14, 2004 | Anyone who's done it knows that microarray analysis can be computationally intensive. However, much of the microarray analysis software out there is proprietary, runs only on Windows, and executes as a single process on a single machine (not very scalable). The BioTeam was recently asked to implement a free, open-source, cross-platform, and cluster-enabled solution for high-performance microarray analysis. As part of the solution, we made use of R and several R packages from BioConductor. We thought we should share some of what we learned.
What is R?
Grossly oversimplified, R is what you use when your data analysis and plotting requirements exceed the functionality of Excel but you don't want to buy a commercial package. R is rapidly gaining widespread acceptance, use, and utility within the bioinformatics community due to its rich functionality, open-source licensing, ease of extensibility, and broad developer support from a worldwide community.
R is both an interpreted programming language and an application environment for the performance of statistical computing and graphing. It consists of a language, run-time environment, graphics engine, and debugger, and it has the ability to execute stored R scripts. R is available for free under the terms of the Free Software Foundation's GNU General Public License in source-code form, and it compiles and runs on a wide variety of Unix-based platforms (including FreeBSD, Linux, and Mac OS X) and Windows.
The R language allows branching, looping, and modular programming using functions. Built-in R user functions are written in R; however, it is possible for the user to provide an interface to procedures written in C, C++, or Fortran for higher performance.
R IS FOR RENDER: R incorporates functions to generate a variety of graphs and plots. Shown here are a few examples of the package's data presentation capabilities.
The core R distribution provides a wide variety of statistical functions, including linear and nonlinear modeling, time series analysis, parametric and nonparametric tests, clustering, and smoothing. Also, many functions provide a flexible graphical environment for publication-quality data presentations in many common graphics formats.
In addition to the R core, R is easily extended through the network-based installation of additional modules (add-on packages) that are available for a variety of specific purposes.
R users have available to them the Comprehensive R Archive Network, with several hundred user-contributed packages ranging from abind (combines multidimensional arrays) to zoo (methods for ordered indexed observations).
Two other large open-source R package repositories of note are Omegahat and BioConductor. The Omegahat packages extend R with intersystem interfaces to things like Java, Perl, Python, and CORBA. BioConductor provides nearly 100 packages that extend R for bioinformatics research ranging from rendering annotations of public genomic data to microarray analysis.
The minimal graphical user interface to R (for Linux, Mac OS X, and Windows) provides a console within a window and menus for basic functions. However, to get the most from R, the user operates either through the built-in shell command interpreter or through command-line execution of text-edited R scripts. The command-line interface makes it rather trivial to submit and execute R scripts over a scalable cluster architecture using a distributed resource management system, such as Sun GridEngine.
With all the components in place, the few remaining bits involve parallelizing the R code for high performance, which we'll leave as an exercise for the reader.
Bill Van Etten is a consultant for The BioTeam. E-mail: email@example.com.