Nov. 13, 2007
| Buying CPUs by the hour is back. Marketing apparently decided that ‘On Demand’ sounds pushy and ‘Utility’ and ‘Grid’ are too rigid. This time it will be called a ‘Cloud’, and it will be running inside Amazon.com. Many groups have offered CPUs by the hour over the years, but the newest efforts seem to address many of the weaknesses of previous efforts. In August 2006, Amazon introduced the EC2 (Elastic Compute Cloud). In recent months, Microsoft began talking publicly about a “Cloud OS,” and a Google-IBM alliance announced a “Cloud” academic-only collaboration.
So why now? Multicore chips may have arrived, but with the exception of a few well-established niches, multi-threaded and multi-process codes to utilize this hardware effectively are going to take a while to appear. At the same time, virtualization is maturing. Running virtual machines is a reasonable way to get a lot of use out of a multicore system. The back room and the data center have known this for years, and consumer products like Parallels and VMWare have pushed that same technology onto the workstation. More users are getting used to the idea of creating machines on demand.
Creating virtual machines somewhere across the Internet is the next step, and that is what Amazon’s EC2 does. While there is still a learning curve, the adoption process is far simpler than what we’ve seen from others. Previous offerings have required an upfront estimate of how many hours to be used, and offers of hundreds of free hours smelled of strings and a sales rep.
Amazon charges ten cents for each hour a virtual machine is running. I can cancel or pause at anytime. Coupled with their storage network, it is possible to use precisely the amount of CPU and disk that I need with a remarkably simple setup process. Two minutes after I request my first machine, I’m SSHed into a root shell. My first hour of kicking the tires cost 11 cents. This freedom to experiment makes an enormous difference.
The machine is a virtual instance, with 1.7 Ghz x86 processor, 1.75 GB RAM, 160 GB local disk, and 250 Mb/s of network bandwidth. This costs $0.10/machine/hour. For more horsepower, machines with 4x or 8x specs are available for $0.40 and $0.80. Persistent storage is independent of the virtual machines for $0.15/GB/month. Transfers within the cloud are free, with data moving between the cloud and the internet costing $0.10/GB on the way in and $0.18/GB for each transferred out. With larger volumes, outbound transfers fall to $0.13/GB.
This pricing model has some interesting properties. Shared resources such as a bio-mirror.net site inside Amazon’s S3 storage would be fast and free to anyone inside EC2. More dramatically, consider a large parallel task with a small input and output, such as MrBayes. If the compute will take 10 machines 10 days, inside Amazon EC2 this compute costs $240. Instead the user can request 100 machines, get the results back in a single day, and it will still cost $240. Rush jobs don’t cost extra. If you’d like to crunch the numbers, Amazon provides a calculator.
Amazon has also simplified is the creation of custom virtual machines. The machine I boot can be one of several standard “base” boot images (fedora, debian, Windows Server 2003, ...). I’ve got root so I can install software, create accounts, and make it suit my needs. After taking a snapshot of that machine, I can boot this custom system rather than the default one. This puts me into my customized environment in 2-5 minutes.
I know I should be making offsite backups of my personal files, but it was always too much of a hassle. Now I use Jungledisk.com and s3sync.net to backup my personal machines into the cloud. Services like Jungledisk are possible because Amazon handles the billing while allowing 3rd party developers to add services and surcharges. This promises to create an interesting new software service environment.
When I first saw the Watson and Venter genome sequence fasta files sitting on the NCBI ftp server I was excited. I booted a machine in the cloud, and began the downloads. During the transfers I built a small BioPerl pipeline to run mpiBLAST and CLUSTALW. While making a Waston+Venter blastable database, I downloaded dbSNP. When that completed I booted a few more machines, and crunched the data for a few days. Afterwards I copied the results down to my local machine, and then turned off my little cluster. I’d spent less than $50.
Facilities that produce large volumes of raw data will continue to find the need for onsite compute resources. Network pipes to the laboratory will not keep pace with next generation sequencing technologies. Some facilities may incorporate a remote cloud into their schedulers. For the moment, many will find that no matter how compelling the service might be, they cannot bring themselves to ship their data offsite. These folks will continue to build out ever larger machine rooms. But if you don’t produce primary data, you don’t need a data center. Smaller shops working with data from the public domain or remote collaborators, including the next generation of biotechs, may find little reason to maintain onsite servers.
I suspect a few network admins will view these as storm clouds, but for science they appear to have a silver lining.
Michael Cariaso is the senior scientific consultant for the BioTeam. He can be reached at firstname.lastname@example.org.
Subscribe to Bio-IT World magazine.