By Kevin Davies
February 6, 2013 | The new Chief Technology Officer at Cycle Computing, James Cuff, spent the past seven years as Director of Research Computing and Chief Technology Architect for Harvard University’s Faculty of Arts and Sciences. His team worked “at the interface of science and advanced computing technologies,” providing a breadth of high-performance computing, storage and software expertise, all the while striving to manage a monstrous surge in data. Cuff previously led the construction of the Ensembl project at the Wellcome Trust Sanger Institute, before moving to the U.S., where he managed production systems at the Broad Institute, while his wife, fellow Brit Michelle Clamp, joined the lab of Broad director Eric Lander.
In his new position, Cuff aims to apply some of his insights and ideas to an even bigger canvas. Cycle has made headlines over the past 2-3 years by spinning up virtual supercomputers for academic and industry clients, as well as creating the Big Science Challenge, donating more than $10,000 in cloud compute time. CEO Jason Stowe says Cuff brings a wealth of knowledge and contacts, and could bring some managerial discipline to Cycle’s patent portfolio. He adds that Cuff will remain in the Boston/Cambridge area, which could impact Cycle’s local presence down the road. (Meanwhile Clamp, who moved to Harvard from the BioTeam last year, will fill Cuff’s shoes on an interim basis while the search for his replacement continues.)
Cuff spoke to Bio-IT World editor Kevin Davies and shared his views about big data, cloud computing, and the future of research computing.
Bio-IT World: James, before we talk about your new gig, what were your chief responsibilities during your tenure at Harvard?
Cuff: It started in the life sciences, due to the complexity with genomics data, but rapidly expanded to include earth and planetary sciences, particle physics, astrophysics, even economics and financial modeling. Simulation is an exploding field in all domains, so we had to be agile enough to help with all fields. We learned about throughput and performance in the life sciences, and were able to apply that to other areas.
What is your visceral reaction to the phrase “big data”? Did you encounter that in all areas?
It’s everywhere you look at this point. From an historical perspective, when I started at Harvard in 2006, we had 200 CPUs and a state-of-the-art, 30-Terabyte (TB) local NAS [network-attached storage] array. As I’m leaving, we’re at 25,000 processors and 10 Petabytes (PB). And that’s just a small, university-wide research computing offering.
In comparison, the breakdown of those data is exploding in all areas, even places like the museums of comparative zoology. People are taking high-quality, high-resolution images, particularly things like the Giza Archives, there are forces at play where our artifacts may unfortunately only be the digital records of some of these areas. Everyone is collecting “big data” but this collection phase is a prelude to a second phase—namely once collected, trying to work out what we ought to do with it so history informs the future. “Big data” is a very hyped term, but it’s real.
The bigger question I think is one of data provenance: not only creating the data but being able to find it. The data retention policies of the National Science Foundation and others—it’s a headache… We’ve seen this in the ENCODE Project, where Ewan Birney and his team even encapsulated their virtual machine along with their data. We’re going to see more of this—to be able to have that frozen report of the science done, as it was.
Many people think the data storage problem per se has been solved. Is that fair?
I’m inclined to agree. The component parts of storage and processing are very much solved problems—from a cent/capability measure. The amount of Terabytes I can buy per given spend or CPU horsepower I can buy is now trivial. The complexity is that we’re doing this at much larger orders of scale and magnitude.
The difficulty in the spending for smart motivated researchers and organizations is around how to orchestrate those events. If you look at a top-tier storage array, the price per Terabyte isn’t where the complexity is, it’s orchestrating Petabytes of this into a single namespace, or being able to make it act at a performance level. I can build a 4-TB storage array that will perform very differently than a 4-TB disk drive.
Are you looking at data management solutions such as open-source iRODS or commercial equivalents?
To keep religion out of the conversation here, the art of finding and annotating metadata at a massive scale is currently unsolved. One of the technology challenges I see ahead is how to accelerate those to the point where the metadata analytics is of sufficient caliber to those that Lustre, WhamCloud, and now Intel can build robust parallel file systems. These are also non-trivial and not necessarily a solved problem either.
From the point of view of being able to find your data, or more importantly, what happened to it?—How did it get to that state?—is a bigger issue. We’re starting to see that, in order to publish, you have to have publishable provenance. What did my grad student do to these data and how reproducible are these scientific results? That’s going to be a big headache going forward.
I trust you didn’t leave Harvard because your wife just arrived. What did Cycle’s CEO Jason Stowe do to lure you over?They were seven really exciting years. We basically built a start-up organization within a very well respected, well regarded traditional IT environment. We started listening to our customers at Harvard—the faculty, grad students, and researchers—and built what they needed.
I started having conversations with Jason about his technology a few years ago. Then the phone rang one day and he explained they were growing because they had too many customers and they want to help their customers more and more. That rang a real bell with me, because as a bootstrapped company, the customers drive what the real business is. I started to talk with his engineering talent—he had me at ‘hello’ basically…
I actually see this as a natural progression. I used to run Silicon Graphics clusters back at Oxford, doing it by myself. When a patch to the SGI came out, I would have to put that on myself. Later on, at the Wellcome Trust Sanger Institute and the Broad Institute, I was the guy between the researcher and the compute. Even more so at Harvard, in many different domains—we were the guys in between. To me, it’s the logical progression—Cycle is that layer between the massive complexity needed to orchestrate hundreds of thousands of unique computer instances, to be able to deliver on our scientific promises.
For me, Cycle is like a light bulb: If I’m a scientist walking into a lab, I want to turn a light bulb on to do my research, my chemistry, etc. I don’t care how the energy is generated and distributed. I just want to throw the switch, do my science, turn the switch off and walk away. I want utility supercomputing to get to that point—to drive both supercomputing and storage to be consumable items as line items on NSF and NIH awards. Computing should no longer be a capital item. It should be an on-demand, as-you-need-it platform.
What do you intend to bring to Cycle? Will you work more on the technical side or closer to the clients?
Cycle has amazing engineering talent—that was where they were founded in terms of building customer solutions. I want to engage our customers more deeply in science outreach and understand their grand challenge science problems. I want to bring to bear many years’ experience at being an interface between a brilliant faculty at Harvard and turn their dreams into viable computing assets.
I was talking to the [Cycle] engineers here this morning, showing me the deep, dark corners of the Cycle server platform. I don’t profess to understand the thousands of hours these guys put into this. I want to help set strategy, work on gaps where we can be more competitive, and that means improve our customers’ experience to the point where everybody gets more work done.
How much more can be done to push research into the cloud? Is cloud computing still underutilized?
It’s still like the early days of the electric company. Just because we have power, the distribution area to light up your house—there was a lot of energy in the early days to be able to handle fuse boxes and complexity. If I think of myself as a lone grad student in a lab and I’ve got a credit card and a happy PI willing to let me spend it, I’m not sure I’ll be all that effective with ‘the cloud’. What is it? It’s a cloud provider’s consul, but I’ve got to build an operating system, I’ve got to get my algorithms ported, I’ve got to work out what interconnects are…
If you look at the QIIME source code from Rob Knight’s lab, there’s thousands upon thousands of dependencies. If you look at cloud adoption, the tinkerers are currently tinkering around the edges, but Cycle has been tinkering for seven years now. We can get them onto these resources and think of them as a utility from the electric company perspective.
It’s the same reason Beowulf clusters were on a slow ramp, but once we started to get cluster administration tools and reliable batch schedulers and Linux stopped forking every two weeks and things calmed down a bit, the top-tier providers—Dell, HP, IBM in particular—embraced cluster computing at a rate we weren’t expecting. We’re a few years away from that [in cloud], but not that far, and Cycle is definitely positioned for the next logical step.
Some experts—notably The BioTeam’s Chris Dagdigian—have said Amazon has a large if not insurmountable lead in cloud computing. Do you agree?
I love Dag dearly, but I’m not inclined to necessarily agree with him. We go where the customers are. Today, the bulk of our customers are within AWS [Amazon Web Services]. To discount any player in this space is a dangerous game. As long as we keep following our customers and the science, I think everyone will be successful. As to the crystal ball—who wins that race? I don’t want to bet on that one..!
I think of researchers like folks picking holiday destinations. They go where the weather’s warm, right? Even if the cost to get down to Florida is a bit high, it’s better than staying in Massachusetts in the winter.
Where do you see growth opportunities in your offerings for the life sciences?
To build credibility that you can leverage and use these resources at a high scale is what some of our recent press has been around. This week, we proved we have the technical capability to do 10,600 processors. But that’s not the business we’re in. We can show massive scale. I had similar conversations at Harvard—the astrophysicists would happily consume tens of millions of CPUs if they could get their paws on it. Museum collections had data challenges but didn’t need 1 million CPU-hours.
Your compute challenge is of the order of 1,000-40,000 processors, which we now glibly consider as small clusters. We’ll have difficult portability issues, security issues, compliance issues. There’s a set of things we want to do to help new customers get that work done. In the financial services areas, there are a lot of ‘just-in-time’ computing challenges of the order of the size of Sequoia or Titan or the National Center for Computational Sciences. Those big clients will always be available on a national level. There’s no way a university should be building a 20- or 30-Megawatt [machine] in a local computing facility to solve their computing challenges.
What new technologies will most impact your services in the near future?
Not to pick any particular technology, but the ability to do high-performance parallel file systems with the ability to retain some control of your metadata in remote computing environments is of considerable interest to me.
I’m also aware of the challenges of the ‘last mile’—you can build national high-speed 100-200-Gigabit/sec networking infrastructure, but if your last mile is a much slower connection, you have to be clever about dealing with the type of technology you need on premises to be able to get in and out of these amazing resources. So other than a teaser to “watch this space,” I’ve been dealing with the last mile challenge for a while—how to get people’s computing off the desktop. That’s what we’ve been doing in a university setting for a long time and I want to apply some of those lessons learned in anger here, with an amazing engineering team who can actually turn some of my dreams into reality.
How critical are technologies that facilitate the transport of big data and how do you interact with them?
The Aspera technology is amazing and those protocols work incredibly well at the national centers—if you’re Ewan Birney or the head of the NCBI and you can license those technologies centrally, where it’s one to many, where many is millions, there’s great benefit.
In terms of on-wire capability—back to the Florida analogy—we go where the weather is warm and our customers are. We’re all going to have to be smarter about how we move data around. The cheapest way is never to move it in the first place. There are techniques and ideas I have in terms of where repositories actually need to be. Does you ultimate repository need to be local? We’re going to have lots of fun there.