Sunny Skies for Compute Cloud



By Michael Cariaso
Loading...
Nov. 13, 2007 | Buying CPUs by the hour is back. Marketing apparently decided that ‘On Demand’ sounds pushy and ‘Utility’ and ‘Grid’ are too rigid. This time it will be called a ‘Cloud’, and it will be running inside Amazon.com. Many groups have offered CPUs by the hour over the years, but the newest efforts seem to address many of the weaknesses of previous efforts. In August 2006, Amazon introduced the EC2 (Elastic Compute Cloud). In recent months, Microsoft began talking publicly about a “Cloud OS,” and a Google-IBM alliance announced a “Cloud” academic-only collaboration.

So why now? Multicore chips may have arrived, but with the exception of a few well-established niches, multi-threaded and multi-process codes to utilize this hardware effectively are going to take a while to appear. At the same time, virtualization is maturing. Running virtual machines is a reasonable way to get a lot of use out of a multicore system. The back room and the data center have known this for years, and consumer products like Parallels and VMWare have pushed that same technology onto the workstation. More users are getting used to the idea of creating machines on demand.

Creating virtual machines somewhere across the Internet is the next step, and that is what Amazon’s EC2 does. While there is still a learning curve, the adoption process is far simpler than what we’ve seen from others. Previous offerings have required an upfront estimate of how many hours to be used, and offers of hundreds of free hours smelled of strings and a sales rep.

Amazon charges ten cents for each hour a virtual machine is running. I can cancel or pause at anytime. Coupled with their storage network, it is possible to use precisely the amount of CPU and disk that I need with a remarkably simple setup process. Two minutes after I request my first machine, I’m SSHed into a root shell. My first hour of kicking the tires cost 11 cents. This freedom to experiment makes an enormous difference.

The machine is a virtual instance, with 1.7 Ghz x86 processor, 1.75 GB RAM, 160 GB local disk, and 250 Mb/s of network bandwidth. This costs $0.10/machine/hour. For more horsepower, machines with 4x or 8x specs are available for $0.40 and $0.80. Persistent storage is independent of the virtual machines for $0.15/GB/month. Transfers within the cloud are free, with data moving between the cloud and the internet costing $0.10/GB on the way in and $0.18/GB for each transferred out. With larger volumes, outbound transfers fall to $0.13/GB.

Clustered Pricing
This pricing model has some interesting properties. Shared resources such as a bio-mirror.net site inside Amazon’s S3 storage would be fast and free to anyone inside EC2. More dramatically, consider a large parallel task with a small input and output, such as MrBayes. If the compute will take 10 machines 10 days, inside Amazon EC2 this compute costs $240. Instead the user can request 100 machines, get the results back in a single day, and it will still cost $240. Rush jobs don’t cost extra. If you’d like to crunch the numbers, Amazon provides a calculator.

Amazon has also simplified is the creation of custom virtual machines. The machine I boot can be one of several standard “base” boot images (fedora, debian, Windows Server 2003, ...). I’ve got root so I can install software, create accounts, and make it suit my needs. After taking a snapshot of that machine, I can boot this custom system rather than the default one. This puts me into my customized environment in 2-5 minutes.

I know I should be making offsite backups of my personal files, but it was always too much of a hassle. Now I use Jungledisk.com  and s3sync.net to backup my personal machines into the cloud. Services like Jungledisk are possible because Amazon handles the billing while allowing 3rd party developers to add services and surcharges. This promises to create an interesting new software service environment.

When I first saw the Watson and Venter genome sequence fasta files sitting on the NCBI ftp server I was excited. I booted a machine in the cloud, and began the downloads. During the transfers I built a small BioPerl pipeline to run mpiBLAST and CLUSTALW. While making a Waston+Venter blastable database, I downloaded dbSNP. When that completed I booted a few more machines, and crunched the data for a few days. Afterwards I copied the results down to my local machine, and then turned off my little cluster. I’d spent less than $50.

Facilities that produce large volumes of raw data will continue to find the need for onsite compute resources. Network pipes to the laboratory will not keep pace with next generation sequencing technologies. Some facilities may incorporate a remote cloud into their schedulers. For the moment, many will find that no matter how compelling the service might be, they cannot bring themselves to ship their data offsite. These folks will continue to build out ever larger machine rooms. But if you don’t produce primary data, you don’t need a data center. Smaller shops working with data from the public domain or remote collaborators, including the next generation of biotechs, may find little reason to maintain onsite servers.

 I suspect a few network admins will view these as storm clouds, but for science they appear to have a silver lining.

Michael Cariaso is the senior scientific consultant for the BioTeam. He can be reached at cariaso@bioteam.net.

----------------------
Subscribe to Bio-IT World  magazine.

 

Click here to login and leave a comment.  

0 Comments

Add Comment

Text Only 2000 character limit

Page 1 of 1

White Papers & Special Reports

sapiosciences
The Workflow Driven Lab
Sponsored by Sapio Sciences

Many companies have recognized that their internal business units operate as a set of business processes. These business processes are also called workflows. Modern Laboratories are highly suitable to this workflow driven approach. In fact, the lab environments successful operation is predicated on the successful definition and adherence to workflows. It could be said that a modern  laboratory is an advanced process implementing construct. It is important that laboratory management software mirrors the process driven nature of the lab thereby increasing automation, shortening learning curves, improving data quality and increasing lab throughput.

  • The modern laboratory is an advanced workflow implementing construct
  • Laboratory Management Software solutions should fully embrace and mirror this process driven approach
  • Effective information management of workflow processes with a LIMS results in increased automation, reduced training curves, better data quality and increased lab throughput


panasas
Curing Life Sciences Data Management Challenges with Scalable Storage
Sponsored by Panasas

High performance storage systems are a given to meet today’s life sciences R&D computational challenges. But with the explosive growth in data produced by next-gen lab equipment, scalability and long-term data management issues must also be addressed. Read this paper to learn:

  • Why new lab equipment will impact R&D workflows
  • How to avoid the hidden costs of long-term data management
  • What approach you should take to accommodate today’s data while having the flexibility to scale to meet future demands.


Quantum
StorNext 4.0: Technical Product Brief
Sponsored by Quantum

 
Proven in the world’s most data intensive industries, Quantum StorNext is a scalable, high-performance file system which allows data sharing across Linux, Mac, Unix, and Windows operating systems and manages data in enterprise storage environments. In this Technical Brief you'll learn:

  • How a high-performing file system can accelerate your business
  • How to simplify your data management
  • How a tiered storage approach can save you money


Life Science Webcasts & Podcasts

Predict or Perish! Shaping the Practices of Clinical Trials
Decisionview webinarSponsored by:  DecisionView

Predictive Analytics are a key differentiator in running your clinical trials successfully through 2010 and beyond. They will help you to optimize your patient enrollment, reduce your clinical operations costs and minimize your financial liability in the clinical supply chain. In this session, you will:
• Learn what predictive analytics are and what they are not
• Understand why you need predictive analytics to run your clinical trials, and
• Explore how predictive analytics will shape the future of clinical trials

Download Now. 

 



More Podcasts

Job Openings

The University of Washington Department of Genome Sciences is seeking a LINUX SYSTEMS ENGINEERING MANAGER to lead a team in a diverse scientific computing environment that includes multiple HPC systems, petascale storage, and custom application servers. Apply online at UW Hires for req number 61505.  http://www.washington.edu/admin/hr/jobs/

Loading...

For reprints and/or copyright permission, please contact The YGS Group, 3650 West Market Street, York, PA;

(717) 505-9701 ext. 125, or via email to Ashley.Zander@theYGSgroup.com.