September 28, 2010 | ‘There are no vendors that have their finger on the pulse of the problem. I can tell you in no uncertain terms that that is the case.”
David Dooling, assistant director of informatics at The Genome Center at Washington University is not mincing words. But he’s not totally throwing storage vendors under the bus either. “The issue is that it’s not a single problem. It’s not just transferring data off the instrument. It’s not just doing analysis. It’s not just storing data long term. It’s not just archiving it off to some lower cost solution after a certain amount of time. It’s not just high performance. It’s not just reliability. It’s all of those things.”
It’s a long list, and Dooling certainly throws the gauntlet for storage vendors. As sequencing gets faster and cheaper, researchers’ demands of storage systems are going to mushroom to levels deemed almost unimaginable just a few years ago.
Analysis not Acquisition
“Most people are talking about how to handle the instrument load,” says Aaron Gardner, IT senior, cyberinfrastructure director at the Interdisciplinary Center for Biotechnology Research at the University of Florida. The bigger issue, however, may well be how much storage will be consumed when researchers start performing meaningful analysis on all those data? “As we tackle larger and larger projects with second gen, third-gen... the lion’s share of the data is going to come from analysis and not just acquisition.”
Guy Coates, in the Informatics System Group at The Wellcome Trust Sanger Institute, agrees. “Data management is the real issue. It’s easy to hook up some storage, hook up some sequencers, and then fill up your storage full of sequencing data. But then you need informaticians to analyze it, maybe you want to archive the data, share it with collaborators, and that’s where we’re seeing the real pain points.”
Coates estimates that the Sanger Institute has about 8 petabytes (PB) of storage now, and has been doubling its storage footprint every year. With collaborations like the International Cancer Genome Consortium (ICGC) and the 1000 Genomes Project, he says researchers will soon need to share a petabyte of data with scientists in China or the U.S. “Moving even terabytes of data across the internet is really hard, and so finding ways to share data effectively is a real challenge,” he says.
But it’s not just external collaborations that slow things down. “Every Ph.D. student or investigator has a totally different way they want to analyze data,” says Coates. “And that’s where you see this big explosion of data, because every Ph.D. student has a different copy of the canonical dataset, because they’re not quite sure where the original dataset lives. All of that data hangs around chewing up real estate for a long time.”
Guillaume Bourque, senior group leader at the Genome Institute of Singapore, also sees applications as the sticking point. “Our first hurdle was the amount of data,” he says. “But now the applications are challenging: submitting, sharing, retrieving, and processing data, and allowing access to shared data.”
Gardner says his colleagues increasingly want to do more than just throw in a sample and retrieve some data the other side. “They actually want to design broader, more comprehensive experiments,” says Gardner. “When that thinking comes into play, the volume of what we’ll call transient or incremental data, the intermediate data required to hammer out an analysis, will drive storage more than maintaining instruments and throughput.”
Bjorn Andersson, director of product marketing and HPC at BlueArc, a storage vendor, says his company recently tested its storage with Illumina to better understand researchers’ needs on the storage side. One of the key lessons was that life scientists live in a more mixed environment in terms of file types, file sizes, and operations. “If you think about having sort of shared storage, where you have many different researchers running experiments and doing analyses, by definition, the workload on that storage is extremely random and very unpredictable.”
The solutions, Gardner says, must address networking and not just capacity. “We’re definitely starting to see latency, throughput, and interconnects starting to matter now that we’re moving so much data and trying to ramp up.”
At the University of Florida, Gardner will be not only scaling capacity but also “making sure we upgrade all front end interconnects to [10 Gigabit Ethernet] as well as trunking multiple 10Gig-E links where needed. We’re looking at increasing capacity, upgrading interconnects, network backbone, and also trying out different protocols and making small incremental improvements to may help with some of the latency issues.”
New Gen, New Problems
In some ways, advancing software has been the saving grace for storage thus far. “As the software has improved, people are less and less adamant about retaining extra bits of information about that base call,” says Dooling. Bourque at GIS is not keeping any images at all. Gardner says that his team is no longer keeping images, instead relying more on automation to handle the storage side of sequencing, which frees them up to spend more time on quality control and quality assurance.
But Dooling warns, “The clock is going to reset as you get to these single-molecule sequencing machines... It’s a whole new data type. You’ve got a new error model that dominates on that data type.” Historically, Dooling says the dominant error mode in Sanger sequencing was base substitutions, the likelihood of which could be represented for any base with a single quality model. “Whereas now the dominant error mode is going to be insertions and deletions. And tying a single quality model to represent an insertion or deletion before that base, an insertion or deletion after that base, is much more challenging.”
Trusting results has a direct impact on one’s storage needs. “If you get a base call and you believe it, you can store it in a very compact way: 2 bit. Right now, the best that we’re doing is a byte [8 bits] per base,” Dooling says.
Coates agrees that future iterations will require changes in the way data are stored, but he thinks that lessons learned now will help in the future. For the past 18 months, Coates and colleagues have been examining which data Sanger scientists really need to look at, as well as which data they say they want to look at but never have.
“We have reduced the amount of intermediate data we keep, but it has taken a long time to get a consensus as to what data is important,” he says. “When we get to next-next generation, I think people are going to be much more aggressive about [throwing away data]. It’ll be hard to wait around to decide what to keep.”
Coates also expects data options to change as new sequencers pour out data. “There is going to be so much data coming off that even doing the pre-processing is going to require that what you get off the end is effectively aligned or quality-scored sequence. The raw data, whatever that may be, you’ll never see it and you’ll never get your hands on it. You’ll just have to live with that.”
The Paying Game
As the cost of sequence generation sinks to the magical $1,000 mark, the cost of sequence analysis and storage, although falling fast, will make up more of the total cost of genome interpretation. Cheap runs might mean cheap reruns, but Diamond Lauffin, VP sales with FileTek, maker of the hardware-agnostic StorHouse data lifecycle solution, points out that cost per run isn’t the only consideration. “We’re not going to have the resources or the time to rerun those experiments,” he says. “We run so many of them now. We’ve increased our workflow!”
“IT is definitely going to become the dominant part of the economics of doing this stuff,” says Coates. And as the cost scales tip toward storage, that will require “a big change in mind set.”
“A lot of people aren’t budgeting for [storage] properly,” agrees Will McGrath, software alliance manager at Quantum, maker of StorNext. “They’re spending all their money on sequencers and then they’re getting 2, 4 terabytes (TB) a week and they’re thinking, ‘Wow, I didn’t plan to budget this much for storage.’”
Singapore’s GIS found itself in that very position. “We were playing catch up for a while,” says Bourque. The Institute increased the size of its data center last year and is eyeing further expansion as it considers the Pacific Biosciences platform. GIS is using StorNext as for its “main pool” of data, Bourque says.
“Part of the cost of the $1,000 genome is the IT component,” agrees Chris Blessington senior director of marketing and communications at Isilon, one of the publicly traded storage vendors whose stock price has surged in 2010. “It’s not just sequencing. It’s building up the IT environment to be able to support that sequencing activity. The extent to which you can do that easily and maintain that easily, and not have to spend a lot of money growing and maintaining your storage helps you achieve that price point.”
Dooling asserts that just finding cheap storage is not the answer. “You can spend $200 per terabyte, but you’re a fool if you do.” The goal is to find the right combination of stability and performance, and that will vary by situation. The Genome Center has several storage administrators maintaining a heterogeneous environment, but Dooling thinks that smaller shops are “probably better off making a compromise on performance” in favor of stability and less complexity.
“You get what you pay for,” he says. “If you’re willing to pay $20,000 to $30,000 per terabyte, you’ll probably get a solution that really works for you. So that’s the dilemma you end up in. Either you make compromises, or...” He pauses. “Basically you have to make compromises.”
The National Human Genome Research Institute (NHGRI) currently stores over 400 TB of genomics data, and turned to Coraid to find the right balance of performance and cost. Coraid’s EtherDrive storage eliminates complex multipathing and controller bottlenecks, explains Carl Wright, the firm’s executive VP of sales. NHGRI chose Coraid’s EtherDrive SR2461 with 10 Gigabit Ethernet connections running between SAN storage and the HPC cluster. The solution is a “good balance between economy and performance,” says NHGRI investigator Elliott Margulies, adding that it provides valuable flexibility in storage configuration.
Flexibility is a buzz word. Larry Jones, VP of marketing at Panasas, sees users at international databanks, research institutions, and pharma all needing flexibility in how they store genomic data. “Not only do they have the data that’s collected from the sequencing machines to be stored, but they also have [their own] ongoing research,” Jones says. “What they want is a single pool of storage that keeps them next to the data, but also segregates them into their own experiments.” In other words, one storage pool dedicated to the research and another pool dedicated to the results. The ability to segregate storage based on different requirements in terms of cost and performance makes Panasas an attractive option, Jones says.
Jeff Denworth, VP of marketing at DataDirect Networks, says that the timing is perfect for DDN. “We’re highly optimized not just for the genome sequencing centers of today, but as these organizations have to grow the amount of data they store by 10X as well as the amount of processing that’s required to assemble and align genomes.”
Denworth says that automated tiered storage systems are becoming essential to leverage media in a timely and purposeful way. For example, DDN’s GRIDScaler product is based on GPFS, an advanced file system technology that provides “much more intelligent storage access and better utilization of network performance to feed compute nodes and HPC clusters as they become larger and these nodes become faster,” he says.
At 500 TB of raw storage across multiple tiers, Gardner at the University of Florida uses a homegrown storage cluster made up of Dell hardware and open source tools. He recently began working with ScaleMP, which offers a virtualization for aggregation technology. Its software aggregates all of the storage cluster systems together so that they act as one large system. Managing and customizing the storage happens in the Linux kernel.
Gardner concedes that things are getting more complicated as his group increasingly emphasizes data analysis. He predicts a change in vendors’ key salable features as well, evolving from an emphasis on capacity, backup and dedupe capability toward issues such as: “How do we quickly and securely deliver the right data to a web service or partnering research center with a minimal amount of overhead? How do we, with the lowest latency, immediately start processing large volumes of data across a distributed network of resources?” A firm believer in open source, even Gardner admits, “At that point vendor solutions and partnerships for storage may become more important [for us].”
For the most part, storage solutions are keeping pace with the surge in sequence data being generated, but that doesn’t stop users from sharing a wistful feature list. Coates dreams of “really good metadata capabilities inside the storage.” He envisions a hybrid of a high-performance file system with a database. “Some combination system where one could have fast file access, with metadata or database-like semantics on top. One where you could say, ‘Find me all the files related to this experimental study and that experiment.’ That would be great.”
Dooling’s list sounds a little more challenging. “I’d like to see more openness and interchangeability between the platforms and solutions,” he says. “I’d like to see a disconnect between hardware and software. If one company has a file system I like and one company has a clustering solution that I like and one company has hardware that I like—meaning that it’s reliable and cheap—I’d like to put that all together.”
And the kicker? “I’d like them not, when there are problems, to point fingers at each other.” •
This article also appeared in the September-October 2010 issue of Bio-IT World Magazine. Subscriptions are free for qualifying individuals. Apply today.