YouTube Facebook LinkedIn Google+ Twitter Xingrss  



INside the Box - Bill van Etten

July 14, 2004 | Anyone who's done it knows that microarray analysis can be computationally intensive. However, much of the microarray analysis software out there is proprietary, runs only on Windows, and executes as a single process on a single machine (not very scalable). The BioTeam was recently asked to implement a free, open-source, cross-platform, and cluster-enabled solution for high-performance microarray analysis. As part of the solution, we made use of R and several R packages from BioConductor. We thought we should share some of what we learned.


What is R? 
Grossly oversimplified, R is what you use when your data analysis and plotting requirements exceed the functionality of Excel but you don't want to buy a commercial package. R is rapidly gaining widespread acceptance, use, and utility within the bioinformatics community due to its rich functionality, open-source licensing, ease of extensibility, and broad developer support from a worldwide community.

R is both an interpreted programming language and an application environment for the performance of statistical computing and graphing. It consists of a language, run-time environment, graphics engine, and debugger, and it has the ability to execute stored R scripts. R is available for free under the terms of the Free Software Foundation's GNU General Public License in source-code form, and it compiles and runs on a wide variety of Unix-based platforms (including FreeBSD, Linux, and Mac OS X) and Windows.

The R language allows branching, looping, and modular programming using functions. Built-in R user functions are written in R; however, it is possible for the user to provide an interface to procedures written in C, C++, or Fortran for higher performance.


R IS FOR RENDER: R incorporates functions to generate a variety of graphs and plots. Shown here are a few examples of the package's data presentation capabilities.
The core R distribution provides a wide variety of statistical functions, including linear and nonlinear modeling, time series analysis, parametric and nonparametric tests, clustering, and smoothing. Also, many functions provide a flexible graphical environment for publication-quality data presentations in many common graphics formats.

In addition to the R core, R is easily extended through the network-based installation of additional modules (add-on packages) that are available for a variety of specific purposes.

R users have available to them the Comprehensive R Archive Network, with several hundred user-contributed packages ranging from abind (combines multidimensional arrays) to zoo (methods for ordered indexed observations).

Two other large open-source R package repositories of note are Omegahat and BioConductor. The Omegahat packages extend R with intersystem interfaces to things like Java, Perl, Python, and CORBA. BioConductor provides nearly 100 packages that extend R for bioinformatics research ranging from rendering annotations of public genomic data to microarray analysis.

The minimal graphical user interface to R (for Linux, Mac OS X, and Windows) provides a console within a window and menus for basic functions. However, to get the most from R, the user operates either through the built-in shell command interpreter or through command-line execution of text-edited R scripts. The command-line interface makes it rather trivial to submit and execute R scripts over a scalable cluster architecture using a distributed resource management system, such as Sun GridEngine.

With all the components in place, the few remaining bits involve parallelizing the R code for high performance, which we'll leave as an exercise for the reader.



Bill Van Etten is a consultant for The BioTeam. E-mail: bill@bioteam.net. 



For reprints and/or copyright permission, please contact  Terry Manning, 781.972.1349 , tmanning@healthtech.com.