June 8, 2011 | For the past couple of years, Connecticut-based Cycle Computing has been helping academic and industrial organizations run high-performance computing projects—from internal management software to research algorithms and simulations—in the Amazon EC2 cloud, and enjoying “hockey stick” growth in the process. “Researchers can sign in on EC2 really quickly. We’ve made a big push to make that process as simple as possible,” says founder and CEO, Jason Stowe.
That effort recently scaled new heights with the debut of Tanuki, a 10,000-core supercomputer in the cloud. (Tanuki is named after the Japanese god of virility and gluttony.)
Last year, Cycle began spinning up several clusters in the 1,000-2,000-core range. Stowe started blogging about their success, including one large GPU-based cluster. But last January, he decided to push the envelope a bit further. “What if we could add a zero to that?” he wondered. “Rather than spin up 1,000 cores, let’s try a 10,000-core cluster. Would anyone be interested in that scale?”
Stowe had been in discussions with Genentech, which was running a series of protein-binding analyses, prompting the Genentech informatics team to start investigating cloud computing applications fall. “As part of a proof-of-concept with Cycle, we ran a two-hour, 4096-core test,” says associate scientist Jacob Corn. “Since that went smoothly, we decided to get some work done on a real problem, and push 10,000 cores for eight hours on a scientific problem.”
Corn and colleagues study the prediction and design of protein interactions, and use computational methods to evaluate hundreds of thousands of possibilities. “Then I cherry pick the most promising looking outcomes to bring into the real world and test in the lab,” he says.
While Stowe briefed Genentech on what would be required to run 80,000 core-hours, he also checked that his engineering team could handle the relevant infrastructure. There was stress testing on the file system and the configuration management software (using Chef). Stowe also elected to use the Condor scheduling software, which he says scales fairly straightforwardly.
Another key issue was load testing. “When you request that many nodes from a cloud provider, you find many cases where the server won’t recognize the disk,” Stowe explains. “We built software to do all the error handling—rather than that being a problem, where a single node on the grid sucks in thousands of jobs and tells you they failed because it can’t read its own hard disk, we handle that automatically. It’s completely transparent to the user.
Stowe called Amazon to ask if there were any recommended times for the experiment. “Amazon’s a great partner, but they had next to no involvement,” says Stowe. “I mostly just wanted to tell them [we were doing this]!”
On March 1, Stowe looked on as Genentech pushed the button, thereby submitting 10,000 jobs to their queue. That simple action effectively harnessed more cores than the 115th-ranked computer on the Top500 supercomputing list at the time. Stowe wanted to be there in person because “it was such a crazy, momentous idea. I was expecting to have to do stuff. But it was actually a pretty boring exercise. Nothing crazy going on—it was pretty awesome.”
The job ran for eight hours, utilizing 10,000 cores, 1,250 servers, and approximately 8.75 terabytes (TB) RAM aggregated across all machines. “That’s a lot of infrastructure, but the entire cluster was up in 30-45 minutes,” says Stowe. And yet the end-user effort was just a few seconds. The cost of the exercise was $1,060/hour, including all infrastructure costs and Cycle’s costs as well.
Having run the same code on Genentech’s internal clusters, Corn says the quality of the results from Tanuki was essentially identical. But the time savings were considerable. “I had been running these simulations internally for a few weeks, and estimated that I probably had 1-2 more weeks remaining,” says Corn. “With the 10,000-core [cloud] cluster, that time was slashed down to eight hours.” Stowe expects to be spinning up a second customer with a 10,000-core environment shortly. “We’re interested in going up to 12,000, 20,000, 25,000 cores.
As we have users for it, we’ll do it. We’ve shown the 10,000 [cores] is canonically easy to just push a button and do it.” •