Cloud Computing for Data Analysis: Toward the Plateau of Productivity

By Ben Langmead

July 29, 2010 | According to Gartner, the impact of a new technology follows a “hype cycle,” which begins with the “technology trigger,” proceeds through the “peak of inflated expectations,” then the “trough of disillusionment,” the “slope of enlightenment,” and finally the “plateau of productivity.” In the case of cloud computing for bioinformatics, the first few phases are increasingly a matter of record, (see, Bio•IT World, Nov 2009). But as the cloud/anti-cloud debate progresses, look for it to become more grounded in experience and measurement.

Why care about cloud computing in the first place? The short answer is throughput: sequencers are getting faster at a rate of about 5x per year, whereas computers get faster at a rate closer to 2x every two years. As both technologies evolve, one processor has no chance of keeping up with the data produced by one sequencer. But many processors do, especially if more processors are dedicated to each sequencer as sequencing gets faster. Using many processors to run one software program is called “parallel computing” and a program’s ability to make good use of additional processors is called “scalability.”

Parallel computing has come in many flavors over the decades (e.g. MPI, Condor), but unlike other paradigms, cloud computing assumes the processors and computers don’t belong to the user, but are rented over the Internet, usually for a fee. Computers and computer clusters rented in this way are owned and maintained by a cloud vendor, like Amazon Web Services or Microsoft.

Because cloud vendors run their services from massive data centers, and therefore benefit from vast economies of scale, they can set prices to be competitive with (and in some cases less than) what users would pay to build and maintain their own computer clusters. But whether or not they save cash, users benefit from having shifted IT workload to the cloud vendor.

In addition to minimizing users’ hardware headaches, cloud computing can minimize software headaches by shifting the burden of software installation from the cloud user to the cloud developer. This is possible because the software that runs on each rented computer is “virtualized,” i.e. compartmentalized into a software “image” with the OS and all necessary software pre-installed. Moreover, because the pool of available cloud resources is so large, cloud users are free to scale their holdings up and down substantially as demand fluctuates (read: as that paper deadline approaches). Because the gap between sequencing throughput and computer speed is ever-increasing, this may be some groups’ only hope for conducting research without being forced to leave sequencers idle while analyses catch up.

Cloud Considerations

Sounds great, right? Alas, cloud computing also has drawbacks. First, cloud users must herd their datasets into the cloud before they can be operated on by cloud computers. Choreographing data movement can be vexing, and the sheer amount of data can overtax the user’s Internet uplink, especially if data from many sequencers are pushed over the same uplink on a regular basis. Potential users are advised to consider their data generation schedule and conduct real-world testing to determine whether their Internet connection can take the load.

Second, users may want to run software that doesn’t yet exist in a form that runs, or runs well, in the cloud. Cloud computer clusters are organized in a “loosely coupled” manner; the computers are fast, but the network over which they communicate and cooperate is slow. This is adequate for problems that can be split into big chunks that each computer solves on its own without coordinating much with other computers. Many comparative genomics problems fit this mold. But problems requiring intense and intricate communication could run slowly in the cloud. Users are advised to evaluate existing cloud software and determine what software (if any) must be developed from scratch for their research. Note that developing cloud software often requires more work and expertise than non-cloud software. Cloud software can be harder to use than the alternatives, and can be quite difficult to debug when things go wrong.

Last but not least, cloud computing raises questions of privacy and security. Users handling data subject to restrictions imposed by internal review boards or government regulations must determine whether their cloud vendor guarantees compliance with the relevant regulations, and may also have to convince their review boards that this is truly the case. Note that both Amazon and Microsoft claim compliance with regulations such as SOX and HIPAA. Users should determine ahead of time whether their cloud vendor provides sufficient guarantees for their current and future data.

So while cloud computing is no panacea, it may also be a much-needed cure to the ever-growing throughput gap between sequencers and computers. With a vigorous, scientific debate, we may even have a shot at reaching the “plateau of productivity” before too long.

Ben Langmead is at the Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health. He can be reached at

This article also appeared in the July-August 2010 issue of Bio-IT World Magazine. Subscriptions are free for qualifying individuals. Apply today.
Click here to login and leave a comment.  


Add Comment

Text Only 2000 character limit

Page 1 of 1