Microsoft Announces General Availability Of Azure CycleCloud

By Allison Proffitt

August 30, 2018 | Yesterday Microsoft announced the general availability of Azure CycleCloud, a tool for creating, managing, operating, and optimizing HPC clusters of any scale in Azure born from Microsoft’s acquisition last August of Cycle Computing.

“With Azure CycleCloud, we are making it even easier for everyone to deploy, use, and optimize HPC burst, hybrid, or cloud-only clusters. For users running traditional HPC clusters, using schedulers including SLURM, PBS Pro, Grid Engine, LSF, HPC Pack, or HTCondor, this will be the easiest way to get clusters up and running in the cloud, and manage the compute/data workflows, user access, and costs for their HPC workloads over time,” wrote Brett Tanzer Partner PM Manager, Azure Specialized Compute, on the Microsoft blog.

Cycle Computing has been attacking cloud challenges since Jason Stowe, his wife Rachel, and two others started the company in 2005. Back then, he was helping companies, including JP Morgan Chase and Lockheed Martin, do high performance computing using Condor as an open-source scheduler. Cycle’s first life sciences client was Varian, a manufacturer of mass spectrometers, in late 2007. Cycle reduced an internal simulation from six weeks to under a day on Amazon’s EC2 just by “spinning it up.” Stowe took the call on a Tuesday, he said, and the calculation was done by Thursday: a classic demonstration of the virtues of Condor, the cloud, and Cycle.

In the years since, Cycle Computing repeatedly made news (and won awards) for doing big computing really quickly in the cloud. In 2012, Cycle Computing collaborated with Schrödinger and Nimbus Discovery to spin up a 50,000-core supercomputer in AWS. In 2014, again with Schrödinger, Cycle moved to 156,000 cloud cores.

All of Cycle’s earliest work was on Amazon’s cloud infrastructure, but already in 2009 Stowe told Bio-IT World he was keeping an eye on other companies who might at some point provide comparable or better infrastructure. He believes he has found one.

“Since we started CycleCloud over 10 years ago, the amount of compute power in a server has continued to exponentially increase,” Stowe told Bio-IT World in an email. “This is in part thanks to FPGAs and GPUs, and Azure has the broadest fleet of these accelerators.”

When Microsoft announced the Cycle acquisition in August 2017, Business Insider called the move “a brilliant acquisition in the cloud wars against Amazon and Google.” Stowe, in a blog announcing the news on Cycle’s website, said, “In short, we’re psyched to be joining the Azure team precisely because they share our vision of bringing Big Compute to the world: to solve our customers’, and frequently humanity’s, most challenging problems through the use of cloud HPC.”

Stowe told Bio-IT World in an email that Azure CycleCloud is a “major release in a few ways: Accessibility, Scale, Templates & Control.” Accessibility improvements come from meeting Microsoft's bar for accessibility and enterprise security requirements, he said. The new release also adds more workload templates and eased cluster scaling.

“We have native templates for every major HPC scheduler, parallel filesystems like GlusterFS and BeeGFS, and application components like Kafka and Redis, that can be started in minutes,” Stowe wrote. “We've made it very easy for these clusters to scale from 64 to 64,000 or more cores cost-effectively in Azure, including support for low-priority VMs, cost alerts for workload clusters, multiple VM family/size workloads, and efficient auto-scaling for MPI jobs across Infiniband infrastructure.”

Azure Use Case

Stowe says he’s always been excited by what customers do with Cycle’s capabilities, and the new power in Azure CycleCloud is driving “amazing workloads from customers, like the really innovative team Woody Sherman has at Silicon Therapeutics. SiliconTx uses an impressive mixture of simulation, quantum mechanics, molecular dynamics, machine learning, and GPUs to actually find new protein 'hotspots' that might be good targets for drug candidates in the fight against disease,” Stowe explains.

Tanzer outlined the case study on the Microsoft blog:

Silicon Therapeutics has created a unique quantum physics simulation technology to identify targets and design drugs to fight diseases that have been considered difficult for traditional approaches. These challenging protein targets typically involve large changes in their shape “conformational changes” associated with their biological function.

The company’s proprietary platform couples biological data with the dynamic nature of proteins to identify new disease targets. The integration of experimental data with physics-based simulations and machine learning can be performed at the genome scale, which is extremely computationally demanding, but tractable in the modern era of computing. Once targets have been identified, the platform is used to study thousands of molecules at the atomic level to gain insights that are used to guide the design of new, better drug candidates, which they synthesize and test in the lab.

Here, Silicon Therapeutics ran molecular dynamics simulations on thousands of targets—both to explore “flexibility” and to identify potential “hotspots” for designing new medicines. The simulations entailed millions of steps computing interactions between tens of thousands of atoms, which they ran on thousands of proteins.

The computations used five years of GPU compute-time, but was run in only 20 hours on 2048 NCv1 GPU instances in Azure. The auto-scaling capabilities of Azure CycleCloud created a Slurm cluster using Azure’s NCv1 VMs with full-performance NVIDIA K80 GPUs, and a BeeGFS file system. This environment mirrored their internal cluster, so their on-premises jobs could run seamlessly without any bottlenecks in Azure. This search for potential protein “hotspots” where drug candidates might be able to fight disease, generated over 50 TB of data. At peak, the 2048 K80 GPUs used over 25 GB/second of bandwidth between the BeeGFS and the compute nodes.

Using CycleCloud, Silicon Therapeutics could run the same platform they ran in-house, and simply scale a Slurm HPC cluster with low-priority GPU execute nodes and an 80TB BeeGFS parallel filesystem to execute the molecular dynamics simulations and machine learning workloads to search for potential new drug candidates.

“In our work, where simulations are central to our decisions, time-to-solution is critical. Even with our significant internal compute resources, the Microsoft Azure cloud offers the opportunity to scale up resources with minimal effort. Running thousands of GPUs, as in this work, was a smooth process, and the Azure support team was excellent” Woody Sherman, CSO at Silicon Therapeutics, said in a statement.

Stowe is just as pleased. “This kind of innovation is what it’s all about, it propels humanity forward, and frankly, it makes the Azure HPC group an exciting place to come to work every day.”