How can we say we need to keep all these bases when our past behavior suggests otherwise?
September 28, 2010 | Inside the Box | The genomics field has gone through enormous change in the last ten years. Despite grumbling from some quarters about genomics “failing to deliver,” we have seen huge progress. As next-generation sequencing (NGS) matures, the next ten years will produce even more changes. For us to maximize the benefit of this fast moving technology, however, we need to devise some coping strategies.
The most obvious obstacle is going to be the sheer scale of data output. Ten years ago, the first human chromosome was published. At the time it was the longest contiguous piece of DNA sequenced—all 40 megabases of it! Now this seems like a test run. Previously the bottleneck in genome sequencing was the actual sequencing stage. The analysis portion, although by no means fast, was of less concern. Today that is no longer the case. A mere two lanes of an Illumina run can produce a good mammalian assembly. And if sequencing a mammalian genome in about a week wasn’t impressive enough, the future promises at least an order-of-magnitude increase in data production. To spice things up, the third generation of sequencers will reduce the price and the footprint even further. This will open up sequencing to many more people. No longer will the large sequencing centers have a near monopoly.
So what does this mean for us computational folk? A big increase in data production means we first have to store it. With data sizes in the tens of terabytes per day, this is not a question of buying a few more hard disks. We have to consider network bandwidth, access speeds, power, cooling, and even the distance between sequencer and storage. Effective backups—something users seldom think about—are vital.
One solution is simple: just don’t store so much data. Although this may seem a drastic step, consider the raw data produced by the sequencers and the initial analysis stage of processing image data into base calls and quality scores. Images always take up large amounts of storage space—the storage needed can exceed 100 times that for the final bases/qualities. Deleting image data as soon as they are processed (or for a short time afterward for quality control purposes) significantly lowers the storage burden. A frequent argument against discarding these data is that we might have to revisit them some day. Algorithms may improve, bugs may be found, and everything will need to be reanalyzed. But keeping data around is not free. Disks cost money, both to buy and to run. Backups need to happen (and be reliable). It is not so far-fetched to say that two years from now, resequencing most DNA will be quicker, cheaper, and more accurate.
Once we’ve discarded those cumbersome image files, what more can we do? Let’s step back from focusing on single bases and take a broader look. In many cases the sequences are a means to an end. We might want to find SNPs/indels in a population or a cell’s transcriptome or which genomic regions bind transcription factors. If we can distill a sequencing run down to some summary data, along with details of the results, why would we still need the individual reads? Why not just store the coordinates and coverage for a transcriptome, or the details about each SNP rather than each read?
It’s worth remembering that many researchers already do this. For instance, once a genome is assembled almost nobody goes back to the reads. We’re usually happy with the consensus sequence, quality scores and SNP calls. And, to be honest, few of us even use the quality scores. Whether this is the correct use of the data is another question, but the point remains, how can we say we need to keep all these bases (at great expense) when our past behavior suggests otherwise? When we’re talking about millions of dollars being spent per year on storage, a ‘just in case’ argument needs a little more substance.
Of course this approach won’t be suitable in all cases. Throwing away all upstream data produced in the Neanderthal sequencing project for instance would be crazy. If there is any question that the DNA is in short supply, obviously we need to preserve data from each step. But how often does this happen? An informal poll of scientists I conducted recently suggests 5-10% of samples in use are in short supply. If so, then our storage problem looks a little more tractable.
This “data deluge” is only the first step. Once we have decided on a strategy for data storage, we still have a huge way to go. The end data will still need analyzing; even stripped down, there will be much more than we’re used to. We’ll need to devise new strategies for automatic analysis—much less will be done by hand. But however hard the next step is, it will be much more interesting than buying disk drives.
Michele Clamp is a senior scientific consultant with the BioTeam. She can be reached at firstname.lastname@example.org.