Google Launches Preemptible Virtual Machines, Cycle-Broad Case Study

By Allison Proffitt

September 8, 2015 | Google this morning has announced the general availability of the Google Cloud Platform’s Preemptible Virtual Machines, and Cycle Computing has detailed its first preemptible VM project: a cancer gene mapping project with the Broad Institute.

Google’s preemptible VMs were launched in beta last May. The compute class offers Google Cloud Platform computing, but with some access limitations: preemptible VMs have 24 hour limits and Google can take that capacity back, “in case of emergencies,” explained Paul Nash, senior project manager of Google Compute Engine. In return, preemptible VMs are offered at a rate that is about 70% less than regular Google VMs, with prices as low as $0.01 per core hour.

The product is clearly designed to compete with Amazon’s spot instances. Both are designed for distributed, fault-tolerant workloads that do not require continuous availability of any single instance. But whereas users can bid on AWS spot instances on an open market, pricing on Google’s preemptible VMs is fixed at a per hour rate.

“It’s the best price we can offer, and we want it to be easy for customers to understand,” Nash said. The goal is to remove uncertainty for clients who may not be sure how much to bid, how long they’ll get to keep their machines, and how much capacity is available.

Google, on the other hand, may take down machines if they need capacity to guarantee regular VMs, for instance—Nash supplied—if there’s a power outage somewhere, or if there’s unexpected demand for regular VMs.

“We’ve designed the system so that’s a very small number on a regular basis,” Nash said of the number of preemptible VMs Google might take down mid-run. Because preemptible VMs are limited to 24 hour runs, Nash explained, Google can better budget their capacity. “What we’ve been doing between beta [release] and [this general availability announcement] is optimizing that balance and seeing how it works in production, so we don’t have to automatically just yank one away from someone, so we can keep that rate of interruption very low.”

Nash wouldn’t give hard numbers on usage over the past few months, but he did say that Google was very pleased with the uptake in general. He would confirm that the beta period has seen “tens of thousands of jobs showing up, which we’re really happy about.”

First Run

For Cycle Computing, this was the first public run on the Google Cloud Platform.

“Cycle follows the customer,” Jason Stowe, Cycle’s CEO explained. “It’s almost a religion around here.” The Broad Institute was the first customer that asked to run the Google platform, Stowe said, and of course he complied.

“We turned around and built a connector between our cluster representation and the Google Compute Engine platform and Google Cloud Storage.”

The challenge presented by the Cancer Program at the Broad Institute was significant. Researchers wanted to use machine learning to infer relationships among several massive datasets—cancer cell line data, mutations present in each cell line, gene expression data, and small molecule interaction data—on both local servers and in an existing StarCluster framework.

“These machine learning algorithms require a lot of compute power. To build this map for only several hundred samples on a single CPU would have required decades of computing,” explained Stowe in his blog post this morning. “It was a sufficiently daunting amount that researchers found themselves holding back from running certain calculations, since prioritizing and scheduling such an effort would have required coordination across many groups.”

Cycle enabled CycleCloud, DataMan and various other aspects of the Cycle toolkit to work on the Google Cloud Platform. Set up took about an hour and a half, Stowe said. “We essentially ran this cluster template at a small scale within 90 minutes having the executables—the workload management system, the batch computing environment, the shared file system, all of those components were just building blocks that we have lying around.”

Once the run began, it took about 6 hours to complete 30 years worth of computing, Stowe said.

The cluster itself ran using Ubuntu images, with a shared file system, and the Univa Grid Engine scheduler. Grid Engine served up the 340,891 jobs to the 3,200 instances in this cluster without issue. The computation ran on n1-highmem and n1-standard instances across zones in a single region.

Late in the run, Google did take back a portion of the active machines, giving the standard 30 second warning. Stowe said the 30 second warning was more than enough time to move workload. “One of the things that Cycle Cloud does is a really good job of is dealing with missing information and error handling,” he said.

Stowe wasn’t recommending Google’s platform over the competition—that’s the customer’s decision!—but he was pleased with the run and the collaboration. “We had a great experience working with preemptible VMs,” he said. “The Google Cloud Platform is a production-quality platform. It works!” he said.

Speaking for Google, Nash agreed that working with Cycle and the Broad was a great experience. “I’m super excited,” Nash said. “In some small way, my product is helping to fight cancer, which is great!”