Big Data Meets Big Compute: Cycle Challenge Winner Logs a “Compute Century” on Amazon Cloud

October 2, 2012

By Kevin Davies  

October 2, 2012 | The winner of the Big Science Challenge, a contest convened last year by Cycle Computing to provide $10,000 in cloud computing resources for groundbreaking biomedical research, has successfully completed the first phase of its project while logging more than 115 compute years on the Amazon Cloud.  

“The goal of the Big Science Challenge was to help people think bigger than they normally would, to do things that would be impossible on a local cluster,” Cycle Computing CEO Jason Stowe told Bio-IT World.   

Victor Ruotti and colleagues from the Morgridge Institute for Research at the University of Wisconsin claimed top prize in the CycleCloud Big Science Challenge, winning $10,000 in utility supercomputing time on Amazon Web Services (AWS). Ruotti’s group is building a database of stem cell RNA signatures to facilitate the identification of new cell types on the basis of their gene expression patterns.   

“We did over 1 million hours of compute -- more than a compute century of calculation!” says Stowe. “The word ‘computer’ was first used in 1897 -- an electronic device that processes data. If you’d started a single core the instant that word was first used, it would have finished this computation right about now.”  

Over the past year or two, Cycle Computing has successfully spun up a series of impressively large supercomputers for clients including Schrodinger and Genentech. This project was a little different however. “In the past, we’ve pushed on scale, but in this case the push was more on efficiency,” says Stowe. “We were able to get more compute/dollar out of any of the other runs we’ve done. From an engineering perspective, that’s a very powerful metric.”   

One million hours might not sound much compared to say, the Open Science Grid, the taxpayer-funded distributed computing resources that can perform 2 million compute hours/day. “But for most customers on the commercial side, earmarking 1 million hours of compute time would take way longer than a week to grab that amount of compute capacity internally,” says Stowe.  

The Big Run  

The intense computing for Ruotti’s experiment – a pariwise comparison of RNAseq signatures for 124 stem cell lines -- was performed over a week using very high memory instances – each core had 8 Gigabytes (GB) memory. About 1.6 million jobs were scheduled using Condor, although Stowe says other schedulers such as GridEngine could also be used. Spot availability varied over time – up to a maximum of 8,000 cores concurrently, with an average of 5,000 cores running.  

The result was that 7-8 Terabytes (TB) BAM files were transferred to Ruotti’s group in Wisconsin.   

For Cycle senior software engineer Ian Alderman, who led the compute work, “The interesting challenge was to get as much of the interesting science done given the budget we had.”  

The original prize budget included $2,500 donated by AWS to supplement Cycle’s $10,000 pledge. When Stowe told AWS Technology Evangelist Matt Wood (a competition judge) that Amazon’s deeply discounted spot instance pricing was helping provide more compute power and that Ruotti’s entire experiment could be finished for another $7,000, Wood agreed to kick in the extra funds for a total of $9,500.  

By bidding for AWS spot instances, Ruotti and Alderman obtained an average price about 1/12 of the normal market cost for the same instances. The danger of relying on spot pricing is that if someone bids higher, then the user loses his instances. “So we kept bidding for machines at this low cost until we couldn’t get any more at that cost,” says Alderman, meaning others had bid higher.   

Because the team could lose the spot instances at any time, they deliberately limited the run time to about 1 hour. The workflow was engineered to tolerate sudden disruptions, which inevitably occured. There was a silver lining, however: “If you’re in the middle of a compute hour, you get that portion of the compute hour for free. So you could benefit from having nodes taken away,” says Alderman.   

Managing the Data  

Some of the biggest challenges concerned managing the data. The input data was about 250 GB/sample, while the final results output was about 8 TB. In between, however, the intermediate data resulting from the matrix of pairwise sample comparisons amounted to nearly 80 TB.   

“We didn’t want to store [the data] too long, so we stored the intermediate results and processed to the final results in a timely manner,” Alderman explains. “That’s a lot of data to be flinging around. How to move it and keep things running was an interesting challenge.”   

Stowe says he hopes that more groups will follow Ruotti’s lead -- as indeed does Ruotti: “Many other centers will realize they have these NGS RNA samples, at some point they’ll realize they can launch these big alignments.”   

The value of the project, according to Morgridge Institute bioinformatician Ron Stewart, is the potential it holds for helping other biologists. “We’re looking at these data at the read level -- not the gene or isoform level. That’s interesting, because there’s a lot of non-coding information that may be valuable for identifying cell types.”   

Ruotti says he was pleasantly surprised about the plasticity of the Cloud and the relative ease -- with Cycle Computing’s assistance -- to get the project off the ground. He also credits Bill Taylor (University of Wisconsin Computer Sciences). The latest, “crucial part of the project,” he says, was done largely with university resources. But Stewart cautions that the technology is not yet easily widely accessible to the bioinformatics community. “It’s not where anyone can do these runs,” he says. “You need assistance from Cycle or a Condor expert.”  

The full analysis of the Amazon computer run will take some time. Says Ruotti, “Once the BAM files are downloaded, we’ll start parsing them, building an object-oriented approach to parse information from these files as well as our [latest] phase 3 analysis.” The results will be deposited into the Galaxy framework to share with the team’s scientists; Ruotti says he is eager to hear their feedback to ensure his team is on the right path.   

In all, Ruotti’s team has performed computation on 124 samples, mostly human ES cells, as well as some other cell types. Discussions are ongoing with the Open Science Grid for compute time to analyze other samples as well as data from the Short Read Archive, which could eventually total more than 1,000 samples.