November 10, 2009 | With a sequencing output approaching 500 Gigabases/week and electricity consumption a whopping 3 MegaWatts, the Wellcome Trust Sanger Institute (WTSI) is a prime contender for cloud computing. Guy Coates, group leader of informatics systems at WTSI, heads a quasi internal consulting team that sits between the hardcore systems teams and the research scientists. Kevin Davies caught up with Coates at Bio-IT World Europe to discuss the institute’s early experiences with cloud computing.
Bio•IT World: What is the attraction of cloud computing for the Wellcome Trust Sanger Institute?
Coates: We have these very ‘spiky,’ very agile, very diverse workloads. We get ambushed when data arrives with very little notice. So one of the things we’ve been looking at is, can we use the cloud as emergency compute? Also, the cloud as a remote data store is interesting—we’ve got so much important data now that we need to think about data recovery and continuity. Traditionally we stuck things on tape and sent them to a warehouse or bunker somewhere. But the economics of doing that on tape now don’t stack up. If it goes into the cloud, we can do really cheap long-term data storage because of economies of scale.
Third is data sharing. We’re a net data exporter—we make all our data publicly available. But getting to it on a remote network is quite difficult. We collaborate with BGI-Shenzhen on the 1000 Genomes Project data, but moving those data backwards and forwards without resorting to a truck full of hard disk drives is difficult. If we need to make data available, do we park it in Amazon and use them as a content distribution network, because they have data centers and we can ask them to replicate data? Rather than people having to get to us, maybe they just need to have fast data paths to their local Amazon data center, which might be a more economical way of doing things…
We started doing this on a small scale. Amazon has a program to make public datasets available. If you want to compute across the Ensembl genome dataset, that’s all there in Amazon, so when you spin up your virtual machines, you don’t have to worry about downloading a copy of Ensembl and uploading it back into Amazon—it’s already there. That model is quite appealing.
What’s the biggest IT challenge given the explosive rate at which you’re churning out data?
Everyone has the standard exponential graph, which tracks sequencing output, disc storage, compute cores. The biggest IT headache is access to compute cycles and storage together. You’ve got the data pipelines, such that we don’t keep data in the early parts of the pipeline longer than it needs to be processed. But the disk pool is constant—I might need all 5000 compute cores for a meeting tomorrow, with hundreds of other users all queued up. Trying to cope with that is hard. We’ve been looking at some feasibility studies—can we take the Illumina pipeline, take the raw data set, chuck it on Amazon, compute it, and get the same answers as we get internally? Can we get it to run in a sensible amount of time? Can we do the downstream analysis? Can we align the data against a reference genome?
Is your use of ‘the cloud’ synonymous with Amazon Web Services?
Yes, Amazon primarily, but we’ve been looking at other providers as well. We’ve been using Amazon’s hub in Dublin because it’s reasonably close to us in Cambridge, U.K. We go across the public Internet—JANET, the U.K. academic network. In theory we have a 2-Gigabit dedicated link onto JANET, which has very fast links between the various network hubs. Trying to get to Dublin, you end up peering out through Telehouse, London, and onto the public Internet, many different network providers. We found the big limiting factor is getting data in and out of the cloud. We realized 5-10% of our theoretical bandwidth. It’s really hard to trace down what’s going wrong.
The first step is to use data distribution tools, which know about wide area networks (e.g. the standard tools (scp/rsync)). They really don’t work well across the wider Internet. There’s a lot of work coming out of the grid communities on more intelligent software. There are also commercial companies who will sell you software to solve these problems. Even using Grid FTP, it’s better, but we’re still not seeing the performance we should. It’s a complicated software stack as well… Finding easy ways to move data to other institutions that haven’t got dedicated IT staffs is going to be really challenging.
We’re chasing down this problem, but how do you fix the Internet?! We have control until it goes outside our borders, and Amazon has control of its space, but in the middle it’s really hard to track down who is responsible for that piece of network infrastructure.
What have you done in the cloud so far?
We’ve taken an Illumina dataset, run the analysis pipeline and done the alignments. Getting the actual software running inside the cloud, we got it to work eventually but it wasn’t trouble-free. We had a very good collaboration with Amazon and their technical people… The big problem, unsurprisingly, was that the same problems we have with I/O and disk inside sequencing data centers, you basically have in the cloud. The storage infrastructure behind Amazon is quite different from what you might find in a data center. We have these big Unix file systems, whereas in Amazon, you can create these Unix systems, but it doesn’t go particularly quickly or scale well. We have problems once you go above 8 nodes trying to do traditional NFS client-server operation, it really doesn’t work.
Amazon’s response is: ‘Don’t do that, then.’ They provide a different storage model called S3, which is web interfaces where you pull blocks of data. You don’t have a Unix file system. It’s much more scalable, and you get good performance across lots of nodes. It’s not traditionally how HPC applications have behaved. We spent time taking all the code that expects to read and write to a file system and get it to talk to this S3 storage layer instead. That way, we were able to get the [desired] performance. There are all sorts of tricks to make that migration easier, if not as efficient. It’s a trade-off between time to rewrite your code and time to just run the stuff in a slightly inefficient manner.
Are there any security concerns with your data in the cloud?
As far as data security goes, Amazon will give you the earth—sets of legal t’s and c’s saying what they’ll do to the data and who it’s invisible to, you can request to lock down sets of machines that can’t talk to each other, and get firewalled off. If you need something tighter beyond what they offer by default, because of extra regulatory targets say, they’re open to discussion. For most of our stuff and the things we’re thinking about, our stuff will be publicly available anyhow, so putting it in the cloud is an easy jump for us.
Fair to say you’ll be doing more work in the cloud in the future?
We’re still feeling it out. Penguin Computing has Penguin On-Demand—a slightly different model. Instead of having virtual machines, you get time on a big cluster with a big cluster file system behind it… If you really want fast dedicated networking, they have one or two datacenters [in the U.S.], but for the moment we just ship them hard discs. The nearer you are, the more likely you are to get a more dedicated link. If we wanted to pay for a dedicated link into Dublin, we could do that. But those things are not cheap—you’d have to lease fiber from a telecom company. The great thing about cloud and computing as a service is that, if cloud provider A is suddenly cheaper than cloud company B, then you can just migrate everything across. But if you’re tied to the physical infrastructure, you’ve changed from an on-demand model to a long-term partnership. The physics of communications may be that you just get forced down that road anyhow. If the data is just too difficult to move, do we just spin up more compute services to allow people to run VMs to compute against our silos of data? That’s an interesting question.
This article also appeared in the November-December 2009 issue of Bio-IT World Magazine.
Subscriptions are free for qualifying individuals. Apply today.