New Big Data Solutions at Sanger Institute

by Aaron Krol

(Corrected 10/9 - see bottom.)

October 7, 2013 | DataDirect Networks (DDN) today announced an expanded partnership with the Wellcome Trust Sanger Institute, to deploy more than 10 petabytes of high-performance storage through DDN’s SFA 12K storage engine. The Sanger Institute has also adopted DDN’s EXAScaler parallel file system appliance, leveraging DDN’s expertise to integrate parallel file systems with the newly-available storage arrays.

The Sanger Institute operates 30 DNA sequencers on a daily basis, each of which generates roughly a terabyte of data every day. “It’s just a question of scale,” Tim Cutts, Acting Head of Scientific Computing at the Sanger Institute, told Bio-IT World. “As the price of sequencing has dropped so dramatically since the advent of next-generation sequencing, the scientists have come up with all sorts of absolutely fantastic ideas, which involve generating mountains of data very quickly.”

Currently the Sanger Institute has a total capacity of 22 petabytes of storage from DDN and other sources*, a figure that Cutts projects will double in the next three to five years – if not sooner. Sitting on a huge and growing volume of data, the Institute decided the time had come to dramatically upgrade their storage capacity. DDN, which had already been providing a portion of the Sanger Institute’s storage with its previous-generation SFA 10K engine, was a natural choice, thanks to the company’s extensive experience working with life science research organizations.

At the same time, the Institute wanted to simplify its treatment of the data it generates. Previously, engineers at Sanger had implemented their own Lustre file system to organize the Institute’s data, bringing in storage and servers from separate vendors and writing their own software. This was a time-intensive process that kept the engineers from working on projects directly supporting the Institute’s research. With the EXAScaler appliance, DDN’s storage arrays are already integrated with a Lustre parallel file system.

“High-performance storage these days is not just hardware,” says Cutts. “There’s a lot of software involved, and the entire stack needs to be understood, preferably by a single organization so that you get a unified support model… That’s the really compelling thing that DDN have offered to us.” The EXAScaler file system will support roughly half of the Sanger Institute’s data; for the other half, which uses object storage, the Institute will continue using an in-house system, although they are contemplating a transition to another DDN product, the WOS object storage platform.

The Sanger Institute has always been at the forefront of big data in the life sciences, having been founded as a crucial player in the Human Genome Project at a time when the data required to capture just one genome was at the very edge of computer storage capabilities. In recent years, whole genome sequencing has become tens of thousands of times more cost-effective – and the Institute’s projects have scaled up accordingly. Fast, cheap genome sequencing contributes to ambitious missions like the Cancer Genome Project, which identifies somatic mutations carried by cancer cells in the hopes of understanding the genetic causes of cancer, and the Pathogen Genomics Program, which not only produces reference genomes for deadly microorganisms, but also uses population genomics to trace the spread and origin of individual outbreaks.

Sanger’s databases are a central part of the genetic research ecosystem worldwide. As a non-profit organization, the Institute makes almost all of its data publicly available, on platforms like the Ensembl Genome Browser where researchers around the world search the genomes of model organisms. With the expanded storage offered by DDN, and the option to smoothly scale up in the future, those public databases will continue to thrive.

DDN is used to the massive data demands of the genomics industry. “We work on the commercial genomics side and on the academic research side, and we’re widely adopted for genomics,” says Laura Shepard, Director of Marketing for HPC at DDN. Sequencing is precisely the type of complex and information-intensive task that DDN specializes in. "This used to be a very disjointed workflow… Being able to [perform every step in the sequencing process] all on one platform, which is able to grow when you encounter these wonderful generational transitions, is something that’s very desirable from a storage partner for these institutions. That is why DDN is not only very well established here but also keeps growing and growing.”

Tim Cutts agrees: “The important thing for us is to have a storage vendor that takes the care to understand the problem that the customer is trying to solve.”

* CORRECTION: The original version of this article cited a figure of 10 petabytes as the amount of storage Sanger is currently using, and 22 petabytes as the amount DDN is supplying. In fact, DDN is supplying 10 petabytes of a 22 petabyte total capacity, and Cutts anticipates Sanger's storage demand to double within 3-5 years. The article has since been corrected.