By Kevin Davies
November 15, 2011 | In many ways, cloud computing has become an ever-present commodity since Bio•IT World published a special issue on the subject (Nov 2009). In September 2011, we held our first standalone conference on the topic. Experts—users and vendors alike—gathered for two days of sharing insights and progress. The takeaway was that more and more users were comfortable with the flexibility, cost, and even security afforded by the cloud. And while Amazon’s omniscient cloud capabilities were a recurring theme, what was even more impressive was the growing ecosystem of commercial and open-source initiatives offering a host of cloud-based services and applications.
Amylin Pharmaceuticals’ Steve Philpott was one of the first biotech CIOs to enthusiastically embrace “the big switch” to the cloud. “Our IT cost infrastructure is 50% less than when we started [in 2008]. My CFO really likes us!” Philpott said. “The box has disappeared… Do you care where your last Google search came from? No.”
But the cloud still raises doubts and concerns. A major issue, discussed below, surrounds security and regulation. Another is whether the cloud can truly handle the masses of next-gen data that is being generated. “Can the cloud satisfy the requirement or can the requirement satisfy reality?” asks Eagle Genomics CEO Richard Holland. “Sometimes it’s not the cloud at all but the Internet—the whole concept of the network and trying to transfer that amount of data. Do you really need to be transferring that much data in the first place? I think people are generating so much data now, and they’re expecting to do with it what they always could, and that just might not be possible. You might have to rethink the whole paradigm of how this works.”
From A to Z
When it comes to the cloud, it boils down to infrastructure-as-a-service, and that means Amazon, “pretty much by default,” says Holland (see, “Eagle Eye on the Cloud”). That said, there are many other players, including Rackspace, Penguin, GoGrid, Nimbus, and Eucalyptus (open-source).
Johnson & Johnson’s John Bowles said that the Amazon environment was “an eye-opener in terms of infrastructure-as-a-service… Seldom mentioned in big pharma is the “opportunity cost.” If it takes six months to get a machine through CapX, there’s no cost for that time.” J&J’s tranSMART knowledge base (see, “Running tranSMART for the Drug Development Marathon,” Bio•IT World, Jan 2010) went live two years ago. “Without the cloud environment, we’d still be arguing!” said Bowles.
According to Amazon Web Services (AWS) senior evangelist, Jeff Barr, AWS is like electricity—a utility that you pay as you use. “On demand, run by experts,” he said. With its roots tracing back to the 1960s (commodity computing, mass-produced computers), Barr said: “We’re past the innovators and early adopters—we’re at the early majority point.”
The advantages of the cloud are well known by now: no capital expenditure, pay-as-you-go, elastic capacity, and (in principle) improved time to market. “You can iterate and cycle more quickly. People love this elasticity,” said Barr. Trying to predict demand using a terrestrial data center is notoriously tricky, and inevitably leads to either an “opportunity cost” (compute power laying idle) or an inability to serve customers (demand exceeds supply). “The cloud can scale up or down, so the infrastructure matches the actual demand,” he said.
The Amazon cloud is spread across six geographic regions: the US (East and West Coast), Singapore, Tokyo, Europe, and one reserved for the U.S. federal government. Users have full control over where their data are stored and processed. “If you have regulatory issues and your data must remain in Europe, that’s fine,” he said. In addition to paying on demand, users can buy spot price instances, the price changing minute to minute. “This enables you to bid on unused Amazon EC2 capacity,” said Barr. “You can use this to obtain capacity very economically.” A recently published example comes from Peter Tonellato’s group at Harvard Medical School, building a pipeline for clinical genome analysis in the cloud (see, “Genome Analysis in the Cloud”).
Barr was excited about Amazon’s new relational database service, which allows users to launch a new database in a matter of seconds. Compare that to a MySQL or Oracle database, Barr said, which might take a year to get up and running. This could allow users to offload common administration tasks—OS and database upgrades, backups etc.—to AWS. Other new initiatives include Cloud Formation and AWS Elastic Beanstalk (a simple way to deploy/manage applications). “We’ve moved from individual resources to entire apps, entire stacks, being programmable and scriptable,” said Barr.
A recent announcement by DNAnexus (see, "New Rules for Archon Genomics X Prize" in this issue) revealed that a mirror of the NIH Short Read Archive of DNA sequence data will be hosted on the Google cloud. BioTeam principal Chris Dagdigian says the platform differs from Amazon’s. “Google has a more integrated platform that you run on top of, while AWS offers infrastructure elements that can be assembled and combined in many different ways,” said Dagdigian. “If the Google/DNAnexus collaboration delivers an easy-to-use compute platform with integrated ‘big data’ support, it could be quite interesting.”
Dagdigian remains a fervent backer of Amazon’s cloud infrastructure. “VMware—that’s not a cloud,” he told the crowd in La Jolla. “If you don’t have an API, or self-service, or email to humans, it’s not a cloud. If you have a 50% failure rate, it’s a stupid cloud.”
Dagdigian sees “a whole new world” when it comes to moving high-performance computing (HPC) into the cloud. Instead of building a generic system accessible by a few groups, one can now stand up dedicated, individually optimized resources for each HPC use case. When it comes to hybrid clouds and “cloud bursting,” Dagdigian recommended a buy-don’t-build strategy. Data movement is a pain. “I’m a fan of open source, but if you’re doing it, buy rather than build,” he said. Companies such as Cycle Computing (see, “Cycle Time”) and Univa UD have happy customers, he said.
“You can’t rewrite everything,” said Dagdigian. “Life sciences informatics has hundreds of codes that will never be rewritten. They’ll never change and will be needed for years to come.” The future of Big Data, he said, lies with tools such as Hadoop and Map Reduce. “Small groups will write such apps, publish, open-source, and we’ll all plagiarize from them,” he said.
While many users want high availability and resiliency, Dagdigian said that “HPC nerds” want speed, “I’d pay Amazon extra if they’d guarantee servers in the same rack,” he said. “HPC is an edge case for obscene-scale IaaS clouds. We need to engineer around this. We have to know where the bottlenecks are.”
Dagdigian couldn’t stop raving about MIT’s StarCluster (“It’s magical,” he said.), Opscode Chef, and GlusterFS (now part of Red Hat), particularly for scale-out NAS on the cloud. And he said CODONiS was a promising start-up for storage and security. As for future applications of cloud computing, Dagdigian insisted that Amazon itself was “no bottleneck—it’s always the server or the Internet.” “Direct-to-S3” file transfer solutions from Aspera (see, “Aspera’s fasp Track for High-Speed Data Delivery,” Bio•IT World, Nov 2010) also looked very promising.
Leading the Charge
With the need to maximize IT efficiency, Amylin’s Steve Philpott led the charge to rethink IT under considerable financial pressure (see, “Amylin, Amazon, and the Cloud,” Bio•IT World, Nov 2009). “We have access to tremendous capabilities without having to build capabilities—10,000 cores, new apps cheap,” said Amylin’s Todd Stewart. “Some tools—manufacturing, ERP, etc.—will always be in the data center. But let’s find process that will work in the cloud.”
With both campus data centers full, Amylin led to pilot projects, including more than a dozen software-as-a-service processes. Stewart noted that there was only one validated app in the cloud. “In general, that’s still something we’re struggling with,” he said. CRM and call center apps have moved over. And Amylin has used Nirvanix Cloud storage for two years—“a cubby hole on the Internet,” said Stewart. “We hope to go tapeless on backups shortly.”
“Chef is something we’ve had a look at,” says Eagle Genomics’ Holland. “As you scale up, it becomes less and less practical to use anything else, to be honest. You could write it yourself with a bunch of Python scripts, but someone else has done it, so why bother?!”
“We launched a 10,000-node cluster with one click [using Chef],” said Cycle Computing’s Andrew Kaczorek. Chris Brown, Opscode’s chief technical officer (and co-developer of Amazon EC2) said: “We’re software architects and system administrators. We’ve run Amazon.com and Xbox live. [We’re good at] automating infrastructure at scale.” The cloud, Brown said, is not necessarily cheaper than standard hosting. “Do you have the money, time, experience? What are you willing to pay for?” he asked. “We take the experience and plug it in for you. You want to manage 1,000 machines instantly. Google, Amazon—they have 100 people. Where can you find a team?”
Chef is many things: a library for configuration management, a systems integration platform, and an API for the entire infrastructure. “Our mantra is to enable you to construct or reconstruct your business from nothing but a source code repository, an application data backup, and machines.”
“Big pharma is concerned about auditing: now we can show a snapshot of cookbooks running on a node. We can reproduce and launch a cluster that mirrors what it was at that date and time. It is very powerful,” said Kaczorek.
The number of cloud applications and offerings is too many to list. Appistry’s new Ayrris Bio offering is poised to make a big impact on life sciences organizations (see, “Appistry’s Fabric Computing”). Assay Depot hosts fully audited, private marketplaces for pharma clients (see, “Assay Depot’s Cloud Services”). Complete Genomics selected the Bionimbus open-source community cloud as a mirror for a major genome dataset. The University of Maryland’s CloVR is a portable virtual machine launched on a desktop that can manage additional resources on the cloud (EC2, academic clouds) for large-scale sequence analysis.
The San Diego Supercomputer Center is rolling out a cloud, “a private data storage cloud to enable the presentation and sharing of scientific data, with rental and owner pricing,” said SDSC’s Richard Moore. It will have an elastic design with an initial capacity of 2 petabytes, although the emphasis will be more on access and sharing. Chemaxon’s David Deng presented a collaboration with GlaxoSmithKline (GSK) on cloud computing, in which 13,500 potential anti-malarial drugs have been made free available, hosted on EC2. Users access the data using ChemAxon’s Instant JChem database management tool, requiring no local software installation. Deng admitted that security was not a huge issue in this particular case, but added his colleagues are eager to set up other collaborations.
Former Microsoft executive and BioIT Alliance founder Don Rule wrapped up the La Jolla conference. “Cloud computing is a very powerful enabler despite the hype,” said Rule. “It’s an important enabler for personalized medicine.” Ironically, Rule is experimenting with EC2 to run an encrypted HIPAA-compliant database. •
Eagle Eye on the Cloud
Richard Holland, CEO of the fast-growing UK consultancy Eagle Genomics, can attest to the growing use of cloud computing. From a strength working with pharma and biotech in pipelines and data analysis, Holland says business is growing and diversifying into areas such as plant science and animal health.
“We’re open-source, very transparent. We use the best tool for the job. Many customers want pilots or proof-of-concepts, eager to give the cloud a go, let us demonstrate how it can work for them. When those succeed, which they usually do, we make them into production systems. Other customers are more forward thinking—they’ve already made commitment and have more detailed projects.”
Eagle has collaborated with Taverna, the open-source workflow tool (see, “Democratizing Informatics for the Long-Tail Scientist,” Bio•IT World, March 2011), in a partnership with the University of Manchester, handling genetic data for diagnostics. Another big customer specializes in microbial genome analysis, and is seeking to graduate from a “creaking cluster” to the cloud. Eagle is also working with the Pistoia Alliance, where Eagle will help run a competition to compress next-gen sequencing data.
Holland says the major stumbling block to wider adoption of cloud computing is not so much getting data up and down (“It’s not a solved problem, but there are some decent solutions.”) but “implementing additional security layers on top of Amazon to convince people it is secure. Amazon’s infrastructure is perfectly fine, but they don’t make it any easier to implement additional layers above the operation system [than anyone else].”
Eagle hasn’t had the bandwidth to tackle any clinical data projects thus far. “It’s straightforward to set up the IT system,” says Holland, “but it’s the paper trail that goes with it. We’re too small to handle that at this stage.” That would appear to be the provenance of the global consulting firms, “but they don’t have the cloud expertise.” Hopefully the two will come together at some point.
Holland said he was struck by how popular open-source software is in the cloud. Many commercial licensing models haven’t adopted, he said. “Open-source is the only thing that can cope at the moment.” K.D.
Genome Analysis in the Cloud
According to Harvard Medical School’s Dennis Wall, whole-genome analysis in the cloud is poised to have the same impact as the development of the autoanalyzer in the 1950s, a now ubiquitous device for blood analysis.
Earlier this year, Wall, Peter Tonellato, and colleagues built a model to estimate the cloud runtime based on the size and complexity of human genomes being compared, to pre-determine the optimal order of the jobs being submitted. For example, an experiment requiring nearly 250,000 genome-to-genome comparisons on Amazon EC2 required some 200 hours, costing $8,000—about 40% less than expected. This highly adaptable model is “potentially of significant benefit to labs seeking to take advantage of the cloud as an alternative to local computing infrastructure,” Wall blogged earlier this year.
Last summer, the same group published a report in PLoS Computational Biology outlining the steps required to perform a whole-genome analysis in the cloud. A full-scale alignment of a human genome took about 10 hours on 36 instances for a modest cost of $320. K.D.
Earlier this year, Cycle Computing span up 10,000 cores for Genentech for $1,060/hour. The company trumped that with a 30,000-core cluster spanning three regions for $1,280/hr for a leading pharma customer. “Why use 30,000 cores? It’s no longer about utilization,” said CEO Jason Stowe. Take data reanalysis—the cloud makes it possible to test new algorithms on historical data in a way never possible before.
Cycle recently announced that it will partner with Pacific Biosciences to optimize the cloud-based version of PacBio’s SMRT Analysis software, supporting a workflow that includes sample preparation, sequencing and completed data analysis in less than one day. A beta version of the solution is expected by the end of 2011.
In terms of managing data, Stowe spoke enthusiastically about Opscode’s Chef software for automating the cloud. “BioTeam discovered Chef a couple of years ago, and we’re equally big fans. It’s very developer friendly,” said Stowe. “It would be much more difficult to run at our scales without the thought that’s been put into Chef from the start,” added Cycle’s Andrew Kazmorek. In keeping with the culinary theme, Cycle uses Grill as a monitoring solution for any Chef infrastructure in the cloud.
Stowe also announced plans to offer $10,000 in free computing time in a competition “to help researchers answer questions that will help humanity.” All the attention on how many cores could potentially be spun up in the cloud left Stowe a little puzzled. “I worried that in all this glitter, we would miss what is truly gold,” he said. “Researchers are in the long-term habit of sizing their questions to the compute cluster they have, rather than the other way around. This isn’t the way we should work. We should provision compute at the scale the questions need.”
Stowe wants “to wreck the status quo of HPC clusters and computational science. We will enable those crazy questions from the misfit geniuses, the ones so big that you would never even ask them… “This is about to get truly exciting… because someone is going to take these clusters and cure cancer, or Alzheimer’s, or my personal affliction, type 1 diabetes. And hopefully cure them faster because they have better tools.” K.D.
Assay Depot’s Cloud Services
“Our goal is to empower the researcher. There’s no way for the average scientist to explore the world of services available to them.” So says Chris Petersen, co-founder and CIO of Assay Depot, which currently boasts more than 600 vendors and several thousand services on its website. The goal is to offer researchers a comprehensive list of services for pharma, such as animal models, antibodies, compound synthesis, and allow them to communicate and rate vendors.
Assay Depot also builds white-label or private versions of the marketplace for individual pharma clients. Pfizer and two other big pharmas have already signed up, with other pilot projects underway. “They can customize it and integrate with finance or sample management systems,” says Petersen. “And they can create private reviews, to learn from your colleagues’ past experiences.”
“It’s all about capturing institutional knowledge. Some researchers might know a lot about a particular vendor in China that other colleagues might not know. In massive companies, many groups don’t know what other groups are doing.” Vendors join Assay Depot for free, and can respond to requests for information. Assay Depot charges a small annual fee if they want to list their services, and pharmas pay a hosting fee if they want a private version.
Assay Depot’s pharma prospects were initially very suspicious about the cloud, says Petersen. “There was some pushback! We got them to use it by telling them we’d give them their own server in the cloud behind the firewall. We host [private] marketplaces on Amazon EC2, each hosted on its own set of servers in a security group. Often in the past, when we used to say EC2 they’d get worried; now they say, check, we’ve already audited them.”
Petersen enthusiastically endorsed Opscode’s Chef. “It is how we do the automated provisioning of our server. I can’t stress enough how important it’s been to automate the infrastructure… The cloud has blurred the line between operations and software development, which really helps a startup. Since Amazon has taken that hardware and turned it into software, that line has really blurred.” K.D.
Appistry's Fabric Computing
Appistry (the name derives from the phrase “Application Artistry”) offers a variety of cloud computing offerings in life sciences and bioinformatics. Michael Groner, co-founder and chief architect, said the company’s mission came from the belief that engineers “spend too much time building in scale management adaptability into our software, more than the value add we want to build in. That seemed wrong.”
Groner says his co-founder, Bob Lazzano, recognized the value of linking cheap computers via software back in 2000. “He had insight that predicted cloud computing seven years before it took off!” says Groner. He called it “fabric computing,” the idea being that Appistry hides hardware beneath the fabric so that it looks like a single computational entity. “We recognize that our favorite phrase lost,” says Groner, not sounding too worried.
That sounds a bit like Al Gore inventing the Internet, but Appistry offers a software tier that ties machines together. “It makes multiple machines act as one, and hides the failures/details from you, so you can have your machines adapt, scale, and modify to your environment, and you can focus on the value of your application,” says Groner.
Appistry is harnessing its experience in clouds, high-performance computing and analytics—FedEx is one of its biggest customers—with its new life sciences offering called Ayrris Bio. Appistry offers its own public service based on its cloud on a per usage basis, including exome, RNA-seq and whole-genome analysis.
“We have everything set up and ready to go right now,” says Groner. “Give us your files; we’re ready.” As for data delivery, Appistry is getting hooked up on Internet 2, but hard drive delivery works well. Groner says the company is offering “head-snapping prices” as low as $250/run and alignments for less than $1,000.
Groner ticks off several advantages of Appistry’s architecture. “We don’t rely on virtualization—our software is as close to the bare metal as possible (unlike Amazon and other public clouds).” But the one he stresses is the usage of computational storage, or as Groner puts it, “where we move work to the data.” Minimizing data movement is essential to solving Big Data analytics in less time, he says. “Our execution environment applies our HPC patterns on to existing, unmodified tools to execute subsets of the analytics on the ‘best’ machine possible,” which is often the machine holding the data to be processed.
In addition to running data in its own environment, Appistry also resells the appliance (up to a single rack) and cloud architecture, so users with very high data volumes or security issues can set up the same system in their own environment. “We realized that for some solutions, instead of just taking a tool perspective, we could sell the complete solution,” says Groner. “We hired some bioinformaticians and put together this complete solution.” K.D.
This article also appeared in the November-December 2011 issue of Bio-IT World magazine. Subscribe today!