Sunny Skies for Compute Cloud



By Michael Cariaso
Nov. 13, 2007 | Buying CPUs by the hour is back. Marketing apparently decided that ‘On Demand’ sounds pushy and ‘Utility’ and ‘Grid’ are too rigid. This time it will be called a ‘Cloud’, and it will be running inside Amazon.com. Many groups have offered CPUs by the hour over the years, but the newest efforts seem to address many of the weaknesses of previous efforts. In August 2006, Amazon introduced the EC2 (Elastic Compute Cloud). In recent months, Microsoft began talking publicly about a “Cloud OS,” and a Google-IBM alliance announced a “Cloud” academic-only collaboration.

So why now? Multicore chips may have arrived, but with the exception of a few well-established niches, multi-threaded and multi-process codes to utilize this hardware effectively are going to take a while to appear. At the same time, virtualization is maturing. Running virtual machines is a reasonable way to get a lot of use out of a multicore system. The back room and the data center have known this for years, and consumer products like Parallels and VMWare have pushed that same technology onto the workstation. More users are getting used to the idea of creating machines on demand.

Creating virtual machines somewhere across the Internet is the next step, and that is what Amazon’s EC2 does. While there is still a learning curve, the adoption process is far simpler than what we’ve seen from others. Previous offerings have required an upfront estimate of how many hours to be used, and offers of hundreds of free hours smelled of strings and a sales rep.

Amazon charges ten cents for each hour a virtual machine is running. I can cancel or pause at anytime. Coupled with their storage network, it is possible to use precisely the amount of CPU and disk that I need with a remarkably simple setup process. Two minutes after I request my first machine, I’m SSHed into a root shell. My first hour of kicking the tires cost 11 cents. This freedom to experiment makes an enormous difference.

The machine is a virtual instance, with 1.7 Ghz x86 processor, 1.75 GB RAM, 160 GB local disk, and 250 Mb/s of network bandwidth. This costs $0.10/machine/hour. For more horsepower, machines with 4x or 8x specs are available for $0.40 and $0.80. Persistent storage is independent of the virtual machines for $0.15/GB/month. Transfers within the cloud are free, with data moving between the cloud and the internet costing $0.10/GB on the way in and $0.18/GB for each transferred out. With larger volumes, outbound transfers fall to $0.13/GB.

Clustered Pricing
This pricing model has some interesting properties. Shared resources such as a bio-mirror.net site inside Amazon’s S3 storage would be fast and free to anyone inside EC2. More dramatically, consider a large parallel task with a small input and output, such as MrBayes. If the compute will take 10 machines 10 days, inside Amazon EC2 this compute costs $240. Instead the user can request 100 machines, get the results back in a single day, and it will still cost $240. Rush jobs don’t cost extra. If you’d like to crunch the numbers, Amazon provides a calculator.

Amazon has also simplified is the creation of custom virtual machines. The machine I boot can be one of several standard “base” boot images (fedora, debian, Windows Server 2003, ...). I’ve got root so I can install software, create accounts, and make it suit my needs. After taking a snapshot of that machine, I can boot this custom system rather than the default one. This puts me into my customized environment in 2-5 minutes.

I know I should be making offsite backups of my personal files, but it was always too much of a hassle. Now I use Jungledisk.com  and s3sync.net to backup my personal machines into the cloud. Services like Jungledisk are possible because Amazon handles the billing while allowing 3rd party developers to add services and surcharges. This promises to create an interesting new software service environment.

When I first saw the Watson and Venter genome sequence fasta files sitting on the NCBI ftp server I was excited. I booted a machine in the cloud, and began the downloads. During the transfers I built a small BioPerl pipeline to run mpiBLAST and CLUSTALW. While making a Waston+Venter blastable database, I downloaded dbSNP. When that completed I booted a few more machines, and crunched the data for a few days. Afterwards I copied the results down to my local machine, and then turned off my little cluster. I’d spent less than $50.

Facilities that produce large volumes of raw data will continue to find the need for onsite compute resources. Network pipes to the laboratory will not keep pace with next generation sequencing technologies. Some facilities may incorporate a remote cloud into their schedulers. For the moment, many will find that no matter how compelling the service might be, they cannot bring themselves to ship their data offsite. These folks will continue to build out ever larger machine rooms. But if you don’t produce primary data, you don’t need a data center. Smaller shops working with data from the public domain or remote collaborators, including the next generation of biotechs, may find little reason to maintain onsite servers.

 I suspect a few network admins will view these as storm clouds, but for science they appear to have a silver lining.

Michael Cariaso is the senior scientific consultant for the BioTeam. He can be reached at cariaso@bioteam.net.

----------------------
Subscribe to Bio-IT World  magazine.

 

Click here to login and leave a comment.  

0 Comments

Add Comment

Text Only 2000 character limit

Page 1 of 1



White Papers & Special Reports

sgi whp 2
Managing the Modern Genomics Data Flood
Sponsored by SGI

Managing and storing the perfect storm of multi-disciplined data pouring from next generation sequencers and other omics instruments is a central challenge in life sciences. Discover in this paper how the SGI ArcFiniti storage solution, optimized for unstructured genomics and life sciences data can: 

  • Reduce costs, proactively protect data integrity, and deliver the high performance I/O required for genomics data processing and analysis.  
  • Effectively manage capacities from 156TB to 1.4PB as a disk based, integrated hardware and software platform 


sgi - whp 1
Turning Genomics Data into Practical Insight
Sponsored by SGI

With worldwide sequencing capacity approaching 13 quadrillion DNA bases annually turning genomics data into knowledge is a true computational challenge. Read this paper and learn how the SGI UV coherent shared memory platform can:  

  • Speed results time while cost competitively tackling the most difficult computational problems across all omics disciplines. 
  • Push performance by scaling to extraordinary levels, up to 256 sockets (2,560 cores, 4,096 threads) per single system (one OS image). 

Provide support for up to 16TB of coherent shared memory in a single system image enabling extreme efficiency across a wide range of compute demands. 



accerlys-logo_2012_wh
New Complimentary Market Survey…
Collaborations and Communications Within Drug Discovery Research
Sponsored by Accelrys
This survey was conducted by the Cambridge Healthtech Media Group in January, 2012. It was sponsored by Accelrys related to their HEOS initiative to gather valid information around externalizing collaborative research while improving communications in the cloud. With 310 qualified industry respondents the survey findings reveal useful usage and trends patterns.  An insightful follow-on discussion and webinar related to this survey, and the HEOS by Scynexis SaaS portal is also available on the Bio-IT World website for complementary viewing.
 


Job Openings

tessella logo 
Scientific Software Engineer
Boston MA
$70,000 to $95,000
 

Tessella delivers software engineering and consulting services to leading pharmaceutical and biotech companies. We are recruiting Software Engineersto work with skilled bioinformaticians and scientists to identify business needs and recommend and develop technical solutions. Applicants require BS, MS or PhD in bioinformatics, biology or chemistry and 2+ years of software development in either: Java, C#, C++, C or VB.NET. 

Apply at http://jobs.tessella.com   

 

oxford nanopore logo 


 Early Access Collaborations Managers
Oxford Nanopore Technologies is developing a novel technology, GridIONTM for the direct, electronic analysis of DNA/RNA and other analytes.  As the system approaches the market, we are building a team of technically knowledgeable, highly motivated candidates with excellent customer service and facilitation skills to join our company as Collaboration Managers.  This is a unique opportunity to work with world-leading genomics customers throughout the early adoption phase of a new generation of DNA sequencing technology.. This is a facilitative, enabling role with responsibility for managing technology development collaborations with key customers at leading genomics institutions.  It will include long term management of the collaboration plan and milestones and associated meetings and documentation. Click here to find out more and apply   

Oxford Nanopore's GridION technology, VP, Sales and Marketing Oxford Nanopore Technologies is a fast-moving technology company that is developing a novel electronic molecular analysis technology. The technology is adaptable for the analysis of DNA/RNA, proteins, chemicals and other molecules.  It is therefore suitable for use in a variety of markets including scientific research and clinical applications.  As the technology approaches the market, Oxford Nanopore is seeking a visionary VP of sales and marketing to join the senior team.  The candidate will embrace the opportunities afforded by entering the market with a truly disruptive technology that has the potential to expand the number of users and the variety of applications in each target market.  This is a rare opportunity to influence the commercial strategy at an early phase of its commercial lifetime, in a well funded company.  Oxford Nanopore welcomes applications from candidates with a track record of high-level strategic commercial  leadership, who wish to apply a fresh approach to existing markets.  Experience in Life Sciences/DNA sequencing is central to this role, however we will consider your application if you have experience of disruptive technologies in other related industries.  We are particularly interested in candidates with strong expertise in the use of digital technologies for sales and marketing of scientific/technical products.  Click to  Apply  


 

For reprints and/or copyright permission, please contact  Tim McLucas, (781) 972-1342, tmclucas@healthtech.com .