Sunny Skies for Compute Cloud


By Michael Cariaso
Nov. 13, 2007 | Buying CPUs by the hour is back. Marketing apparently decided that ‘On Demand’ sounds pushy and ‘Utility’ and ‘Grid’ are too rigid. This time it will be called a ‘Cloud’, and it will be running inside Amazon.com. Many groups have offered CPUs by the hour over the years, but the newest efforts seem to address many of the weaknesses of previous efforts. In August 2006, Amazon introduced the EC2 (Elastic Compute Cloud). In recent months, Microsoft began talking publicly about a “Cloud OS,” and a Google-IBM alliance announced a “Cloud” academic-only collaboration.

So why now? Multicore chips may have arrived, but with the exception of a few well-established niches, multi-threaded and multi-process codes to utilize this hardware effectively are going to take a while to appear. At the same time, virtualization is maturing. Running virtual machines is a reasonable way to get a lot of use out of a multicore system. The back room and the data center have known this for years, and consumer products like Parallels and VMWare have pushed that same technology onto the workstation. More users are getting used to the idea of creating machines on demand.

Creating virtual machines somewhere across the Internet is the next step, and that is what Amazon’s EC2 does. While there is still a learning curve, the adoption process is far simpler than what we’ve seen from others. Previous offerings have required an upfront estimate of how many hours to be used, and offers of hundreds of free hours smelled of strings and a sales rep.

Amazon charges ten cents for each hour a virtual machine is running. I can cancel or pause at anytime. Coupled with their storage network, it is possible to use precisely the amount of CPU and disk that I need with a remarkably simple setup process. Two minutes after I request my first machine, I’m SSHed into a root shell. My first hour of kicking the tires cost 11 cents. This freedom to experiment makes an enormous difference.

The machine is a virtual instance, with 1.7 Ghz x86 processor, 1.75 GB RAM, 160 GB local disk, and 250 Mb/s of network bandwidth. This costs $0.10/machine/hour. For more horsepower, machines with 4x or 8x specs are available for $0.40 and $0.80. Persistent storage is independent of the virtual machines for $0.15/GB/month. Transfers within the cloud are free, with data moving between the cloud and the internet costing $0.10/GB on the way in and $0.18/GB for each transferred out. With larger volumes, outbound transfers fall to $0.13/GB.

Clustered Pricing
This pricing model has some interesting properties. Shared resources such as a bio-mirror.net site inside Amazon’s S3 storage would be fast and free to anyone inside EC2. More dramatically, consider a large parallel task with a small input and output, such as MrBayes. If the compute will take 10 machines 10 days, inside Amazon EC2 this compute costs $240. Instead the user can request 100 machines, get the results back in a single day, and it will still cost $240. Rush jobs don’t cost extra. If you’d like to crunch the numbers, Amazon provides a calculator.

Amazon has also simplified is the creation of custom virtual machines. The machine I boot can be one of several standard “base” boot images (fedora, debian, Windows Server 2003, ...). I’ve got root so I can install software, create accounts, and make it suit my needs. After taking a snapshot of that machine, I can boot this custom system rather than the default one. This puts me into my customized environment in 2-5 minutes.

I know I should be making offsite backups of my personal files, but it was always too much of a hassle. Now I use Jungledisk.com  and s3sync.net to backup my personal machines into the cloud. Services like Jungledisk are possible because Amazon handles the billing while allowing 3rd party developers to add services and surcharges. This promises to create an interesting new software service environment.

When I first saw the Watson and Venter genome sequence fasta files sitting on the NCBI ftp server I was excited. I booted a machine in the cloud, and began the downloads. During the transfers I built a small BioPerl pipeline to run mpiBLAST and CLUSTALW. While making a Waston+Venter blastable database, I downloaded dbSNP. When that completed I booted a few more machines, and crunched the data for a few days. Afterwards I copied the results down to my local machine, and then turned off my little cluster. I’d spent less than $50.

Facilities that produce large volumes of raw data will continue to find the need for onsite compute resources. Network pipes to the laboratory will not keep pace with next generation sequencing technologies. Some facilities may incorporate a remote cloud into their schedulers. For the moment, many will find that no matter how compelling the service might be, they cannot bring themselves to ship their data offsite. These folks will continue to build out ever larger machine rooms. But if you don’t produce primary data, you don’t need a data center. Smaller shops working with data from the public domain or remote collaborators, including the next generation of biotechs, may find little reason to maintain onsite servers.

 I suspect a few network admins will view these as storm clouds, but for science they appear to have a silver lining.

Michael Cariaso is the senior scientific consultant for the BioTeam. He can be reached at cariaso@bioteam.net.

----------------------
Subscribe to Bio-IT World  magazine.

 

Click here to login and leave a comment.  

0 Comments

Add Comment

Text Only 2000 character limit

Page 1 of 1

White Papers & Special Reports

sas whitepaper92

Managed Innovation, Assured Compliance 
sponsored by SAS
Discovery organizations are identifying a lot of promising compounds, but clinical research processes haven't kept pace with timely testing of all those potential therapies. This white paper describes how SAS® Drug Development supports true innovation across the clinical trial process.

In this white paper you will learn how to:

  • Assemble data to foster better collaboration
  • Get up-to-date information during clinical trials
  • Make informed decisions earlier in the trial process Download now 


BlueArc white paper image 1

Addressing Life Sciences Constantly Growing Data Challenges Research Environments
sponsored by BlueArc
The continued explosion of raw experimental data, the increased use of video, the growing adoption of new data retention practices, and the move to high throughput computational workflows are all placing new demands on the way life sciences organizations store and manage their data.

Download this white paper to learn about:

  • Factors driving the data explosion in the life sciences
  • New data management issues that must be addressed
  • HPC trends that are placing new demands on storage
  • Storage solution attributes that address performance, manageability, and energy efficiency. Download now 


isilon white paper

“Storage for Science – Methods for Managing Large and Rapidly Growing Data Stores in Life Science Research Environments” sponsored by Isilon
Large and rapidly growing stores of file-based and other data are a hallmark of life science research and bioinformatics. Determining how best to manage those data stores has become a significant challenge for Researchers and IT Pros alike.

This paper is intended to:

  • Provide guidance on the many storage requirements common to Life Science research;
  • Explain the evolution of modern storage architectures;
  • Summarize the major data storage architectures currently in use.

Additionally, it will present the Isilon IQ clustered storage product as a strong and flexible solution to those needs. Download now



Life Science Webcasts & Podcasts

Adobe

Hospital Paperwork No Longer Has to Be an In-patient Procedure 

Adobe podcast imageHow many times have you filled out that same patient registration form when visiting a doctor or the hospital? If you are a hospital administrator, nurse or registrar, you know that your patients and particularly your staff have managed hundreds of consent forms for medication, procedures, anesthesia, and HIPAA. Paperwork redundancy has become a significant bottleneck in the healthcare system. In this podcast, we’ll learn about how Adobe solutions for healthcare can help you streamline your paperwork and stop making paperwork an in-patient procedure.

Download Now 



More Podcasts

Job Openings

Oxford Nanopore Technologies, Oxford, UK
We seek a highly motivated individual to lead the administration, expansion and maintenance of our IT infrastructure, supporting our business operations and technological development of a DNA third generation sequencing system.  Includes administration and configuration of core corporate servers, high performance scientific computing and disk systems, security systems, network infrastructure and backups, maintenance of service levels, implementation of any IT related legal compliance issues and policies, and disaster recovery. to apply: www.nanoporetech.com/vacancies





For reprints and/or copyright permission, please contact The YGS Group, 1808 Colonial Village Lane, Lancaster, PA;

(717) 399-1900 ext. 125, or via email to Ashley.Zander@theYGSgroup.com.