Two Guys and a Credit Card: Metrum’s Amazon Cloud Makeover
A case study in transferring a life sciences IT infrastructure fully into the cloud
By Kevin Davies
February 14, 2012 | Over the past few years, many life science organizations have dabbled in cloud computing and explored infrastructure-as-a-service, with varying degrees of enthusiasm and commitment. But one Connecticut company has decided to go for broke—transferring its entire IT infrastructure onto the Amazon cloud.
“We are among the first, if not the first in our field,” says Jeffrey Hane, CIO and COO of Metrum Research Group. Metrum is leveraging the capabilities of Amazon Web Services (AWS) for its core pharmacometrics technology platform, with Elastic Compute Cloud (EC2), Simple Storage Service (S3), and Elastic Block Storage (EBS) services. Highly secure VPC environments are under development and nearing deployment.
“This is natural progression,” says Adam Kraut of The BioTeam consultancy. “Most companies just want to dip their toes in, but Metrum was prepared to go all in.”
Metrum Research Group is a New-England based biotech/pharma R&D contract services firm established in 2004, specializing in pharmacometrics. Hane spent some 20 years in big pharma, with stints in discovery, development, quality assurance and regulatory compliance.
“When we formed the company, our president and CEO, Marc Gastonguay, brought scientific expertise and leadership in pharmacometrics, while I brought the compliance and IT expertise,” Hane told Bio-IT World. Much of Metrum’s business deals in quantitative pharmacology, otherwise known as pharmacometrics. “We assist pharma/biotech partners in making key decisions, such as proof of concept, dose selection, and clinical trial designs, through applied modeling and simulation,” says Hane, adding that Metrum has had more than 100 clients in total.
Metrum scientists integrate quantitative data from proprietary development programs and published sources to build mathematical models of disease and drug action. These models are then used to address key questions through simulation of plausible scenarios, generating probability distributions of expected outcomes. Such analyses help to improve the accuracy of decision-making and the probability of success in subsequent clinical trials. “The goal is to assist clinical development teams in making informed and effective decisions throughout the development process,” says Hane.
A year ago, Metrum relied on a traditional secure, co-located computational grid system, with 32 cores running on a system comprised of six servers to handle its modeling and simulation projects. Because of an unpredictable workload, Metrum staff frequently had to queue up to use the grid, while at other times it would lie untouched for a couple of weeks. “And 32 cores was never enough,” says Hane. “The machines were very well defined, very well managed—and very limiting .”
As Metrum hired new scientists, Hane and colleagues recognized the obvious need to expand their high-performance computing infrastructure. “Unlimited, flexible compute power was the driving force to go to the cloud. We also wanted to simplify, and we didn’t want a hybrid system. So we opted to move everything up.”
A further catalyst came when there was a disk failure at the co-locate system. Metrum didn’t lose any data, but the incident reminded Hane that the grid wasn’t necessarily as robust as it might be.
In addition, Hane was keen to abolish the responsibility of owning and maintaining local computational hosts. “We only saw a downside to having physical machines [onsite],” says Hane. (While all the key Unix-based simulations were run on the co-located grid system, Metrum housed a couple of local servers, principally to run Windows-based software.)
To learn more about the adoption of cloud computing in pharmacometrics, Hane and colleagues enrolled in a training course run by Kraut and his BioTeam colleague, Chris Dagdigian. That led to a separate one-day brainstorming between the groups where a plan was devised to set up the entire infrastructure in the cloud. “We showed up with the credit card on the website just like everyone else. Any two guys and a credit card can do it,” says Kraut.
Kraut’s efforts began focusing on a key high-performance computing app that Metrum was attempting to adapt to cloud architecture. With some work, and the development of a set of specialized plug-ins, the application transfer to AWS went smoothly, Kraut recalls. “The group got so excited, they were so empowered, they’d been locked in a colo space with a few Apple servers, and no real support. They were afraid to change—they were just happy things were running with no lost data or functionality.”
Kraut adds: “Now, rather than seeing a few Apple servers locked in a cabinet, you see Bill’s cluster and Dan’s cluster and John’s cluster—they’re all managing their own clusters and resources, working independently.”
From 32 cores available to the entire company, each Metrum scientist now has access to 100 or more multi-core instances, individually launched on demand from their workstations. “It’s great. With this computational power, we can provide quicker turnaround,” says Hane. “For our clients, in some cases, we can say we’ll get back to you after lunch rather than after a week.”
The quality of service for clients has greatly improved, says Kraut. “Now clients don’t have to wait in a queue… All projects get priority.”
But Metrum’s scientists quickly saw the possibilities in a bigger switch, not so much in cost savings or cloud bursting but in changing the culture. “We quickly realized we wanted to do something comprehensive and clean, and considered moving everything [up to AWS],” says Hane.
In all, Metrum’s cloud transfer took about six months and involved a novel architecture and the development of a handful of proprietary plug-ins. When Hane says everything, he means everything. “All our computational work is performed by accessing AWS, right from laptop workstations. All we need is an Internet connection, nothing else.” Hane says there was no need to consult Amazon during the process.
Kraut believes that this “self-service” capability is a new trend. Now, “You can be a developer and a system admin in a collapsed role—it fits very well. They can bring up servers and turn them off—they’re responsible for the services they use.”
Hane says his main concern was protecting data, but even during the highly publicized East Coast Amazon EC2 outage last year (which briefly took down several popular social networking sites), Metrum remained up and running. “This event really did not affect us. Our ability to script the launching of new replacement instances allowed us to continue work seamlessly, and there was no data loss whatsoever,” says Hane. “We have complete control of our AMIs, and can rebuild our systems in short order through a tightly controlled and documented configuration management system.”
Now that he’s been running for a few months, Hane says the AWS infrastructure is much easier to manage than before, when his team had to manage the operating system, hardware, and software updates. “There was a lot to worry about,” he says. “Now, we have much less to manage—an Amazon Machine Image (AMI), compute software, and plug-ins. We can deploy this across many computational cores. Scientists from a laptop workstation can initiate runs using our custom AMI based on StarCluster. We only have to manage one image,” which has important implications for regulatory compliance.
Asked about return on investment, Hane suggests waiting a year or so. “In general, the overall compute costs are comparable,” he says. “But we have enabled greater project throughput resulting in greater customer satisfaction. That, and the reduced cost of system administration seem to make it obvious that the ROI will be favorable.”
As for problems, the biggest one Hane cites is the issue of building redundancy across regions with AWS. “Moving machine images between regions is possible, but not so easy to do,” he says. He expects that Metrum’s AWS system will eventually use a fully redundant system in multiple regions, “so we can figuratively move to another region with a flick of the switch.” There is also a contingency plan to move to another cloud provider (say RackSpace) if necessary.
Hane says the system is “a game changer” for Metrum, “because it allows us to imagine offering potential on-demand value-added services for some clients who may not have internal grids, and want to access one but don’t want the expense.” Metrum has explored this idea with a several clients—sharing Metrum’s AMI and providing instructions on its use. Hypothetically, Metrum could rent clients a machine image and start web services for them.
Hane fully expects to be challenged on cloud-based practices relative to 21 CFR Part 11 compliance. “We have better control over our system compliance than ever, but we need to pay continued attention to this,” he says. “We also have to identify improved environments and tools to provide security and explain our current model. Our clients expect control and management of the resources we use, and our new system makes us confident we have achieved that.”
Editor’s Note: Jeffrey Hane and Adam Kraut will present their collaboration in the opening of the Cloud Computing track at the 2012 Bio-IT World Conference & Expo on Wednesday, April 25.