June 13, 2007
| It is all but passé to mention the flood of primary laboratory data that has shaped bioinformatics over the past several decades. First automated sequencing, then microarrays, and now next-generation sequencers have pushed the capacities of both storage and computational systems. Exponential growth has characterized the public sequence repositories for so long that I rarely use my tired “sequences stored in NCBI over time,” slide anymore. Everybody already knows the story.
It is with some hesitation, therefore, that I wade into another column about yet another flood of data. This one is interesting, however, because it will not come from the large sequencing centers, but from a large number of smaller labs as they start to use next-generation sequencing machines. Over the past few months, I’ve spent time working with some early adopters of the technology — Timothy Read’s group at the Navy Medical Research Center in Rockville, Maryland. What follows are some observations based on that experience.
For the past three decades, the vast majority of DNA sequences were read using variants of Sanger sequencing. Clever engineering and chemistry pushed the data generated by a single run of a sequencing machine from hundreds of bases to around 1500 bases. Sequencing even the smallest genomes incurred a huge burden of lab work. Moreover, scientists had to prepare thousands of samples and analyze thousands of independent result files.
Next-generation sequencing machines (see “The Drive for the $1000 Genome,” Bio•IT World, May 2007) use completely different biochemical technologies and do not appear to suffer the same limitation. A single instrument run produces information on hundreds of millions of base pairs. For many prokaryotes, this can remove entirely the requirement of cloning out fragment libraries and performing a final assembly — a single machine run can read every base pair of its genome many times over. For larger genomes, multiple runs are still necessary, but perhaps 100,000 fold fewer. As always, there are certain interesting exceptions. That said, with an initial investment of several hundred thousand dollars in a sequencing machine, plus costs of between $5,000 and $10,000 in reagents per run, a small lab could produce unprecedented amounts of data.
Read’s group plans to sequence several strains of pathogenic microbes per week, every week, for the next several years. They will do this with a lab staff of fewer than 10 people on a pair of FLX machines from Roche (formerly 454). A single sample run on these devices requires approximately 7 hours, and produces between 20 and 30 GB of primary image data. This renders into about 100 Mb of sequence per run. (I have not worked with the Illumina 1G Analyzer yet, but it produces about 1 TB of primary image data over a 3-day run, yielding several hundred million base pairs of sequence.)
Those familiar with chromatograms and microarray images will recognize the split between “primary instrument data” and “finished sequence.” The large, primary data will rapidly become unmanageable, relative to the refined sequence data. Industry groups are already proposing data format and experimental “metadata” standards, which may or may not be applicable as researchers begin using these devices in a production capacity. But whether it is 1 TB per day or every couple of weeks, these devices are going to push the limits of data handling for smaller labs.
Shifting Buckets, Catching Leaks
Read’s lab is planning to capture 0.5 TB of raw image data per month over the next 12 months. This isn’t unreasonable with their current data storage infrastructure, though it’s certainly beyond the “firewire disks on shelves” solution I’ve seen elsewhere. Initially, we plan to keep both raw data and finished sequence on disk. We will then re-assess what must be kept and what can be archived to tape. Larger genome centers don’t have this option, as they are starting off with the unmanageable, today.
We are also considering how to manage the data workflow from sample preparation through comparative genomic presentation. Questions formerly belonging to the big sequencing centers are now pressing problems in smaller labs. Keeping an organized lab notebook, on paper, is insufficient to support a production pipeline of this sort. We’re developing a web-based “dashboard” to allow everyone to understand the progress and provenance of data as it arrives at their door. This begs the question of the pipeline itself, be it scripts, “smart folders,” or some more glitzy workflow solution. At the very least, all data motion and standard processing tasks must be automated to prevent the newly capable lab from burying the informatics staff under a rising tide of trivial data motion tasks.
The term “Sneakernet” was coined decades ago for the network of athletic-shoe clad staff carrying data on physical media from building to building and coast to coast. This practice is now being adopted on research campuses, which suddenly need to move a terabyte here, a terabyte there, out of the lab. As the old saying goes: “Never underestimate the bandwidth of a station wagon filled with disk drives.”
The core problem is not that disks are too small or that network pipes are too narrow. Slapping a technological fix (expensive switches and a few petabytes of high-end storage in the central facility, for example) on this problem would miss the point entirely. In order for a system to grow in a sustainable manner, users must build locally sustainable islands of functionality. This means bringing the storage and the computing capacity to the lab, rather than shuttling it off to central repositories. This will probably entail either a small cluster or a multi-core machine dedicated to each sequencer, as well as tens of terabytes of disk storage closely paired with the lab equipment.
This shift to local storage and processing will require changes in organizational roles at many levels. The standard institutional answers won’t suffice for a lab that produces TB data per day, especially if the problem expands to every lab on campus. Researchers will need to take an active role in communicating their needs to central IT groups, and listening to the constraints on available IT solutions. Both sides should be compassionate through this process, since it is about to become very easy to demand the impossible.
Major sequencing centers are examining the cost tradeoffs involved in analyzing and then discarding the primary image data, rather than trying to archive it all in a constantly accessible form. It may well be both cheaper and more effective to re-sequence samples of interest, rather than invest in petabyte-scale network storage. My apologies to any experimentalists in shock at the thought of discarding any of their precious primary data! Caution is certainly necessary, and a consideration of all the options. It’s all part of the brave new world of next-generation sequencing.
And so we await with interest the reaction of the traditional sequence data repositories to the challenges of next-generation sequencers. For now, they have some breathing room, until those express-mail packets of disk begin to arrive at their door.
A final note: The prefixes that follow “Tera,” are “Peta,” then “Exa.” Start getting used to them. These machines are not the pinnacle of a technology, but the first iteration of the next generation. Those that follow will be cheaper, faster, and more accurate. We’ll be talking about data woes for many years to come.
Chris Dwan can be reached at firstname.lastname@example.org.
Photo by Mark Gabrenya
Subscribe to Bio-IT World magazine.