Of Sequences and Sneakernets


By Chris Dwan
June 13, 2007 | It is all but passé to mention the flood of primary laboratory data that has shaped bioinformatics over the past several decades. First automated sequencing, then microarrays, and now next-generation sequencers have pushed the capacities of both storage and computational systems. Exponential growth has characterized the public sequence repositories for so long that I rarely use my tired “sequences stored in NCBI over time,” slide anymore. Everybody already knows the story.

It is with some hesitation, therefore, that I wade into another column about yet another flood of data. This one is interesting, however, because it will not come from the large sequencing centers, but from a large number of smaller labs as they start to use next-generation sequencing machines. Over the past few months, I’ve spent time working with some early adopters of the technology — Timothy Read’s group at the Navy Medical Research Center in Rockville, Maryland. What follows are some observations based on that experience.

For the past three decades, the vast majority of DNA sequences were read using variants of Sanger sequencing. Clever engineering and chemistry pushed the data generated by a single run of a sequencing machine from hundreds of bases to around 1500 bases. Sequencing even the smallest genomes incurred a huge burden of lab work. Moreover, scientists had to prepare thousands of samples and analyze thousands of independent result files.

InsideBox.jpgNext-generation sequencing machines (see “The Drive for the $1000 Genome,” Bio•IT World, May 2007) use completely different biochemical technologies and do not appear to suffer the same limitation. A single instrument run produces information on hundreds of millions of base pairs. For many prokaryotes, this can remove entirely the requirement of cloning out fragment libraries and performing a final assembly — a single machine run can read every base pair of its genome many times over.  For larger genomes, multiple runs are still necessary, but perhaps 100,000 fold fewer. As always, there are certain interesting exceptions. That said, with an initial investment of several hundred thousand dollars in a sequencing machine, plus costs of between $5,000 and $10,000 in reagents per run, a small lab could produce unprecedented amounts of data. 

Read’s group plans to sequence several strains of pathogenic microbes per week, every week, for the next several years. They will do this with a lab staff of fewer than 10 people on a pair of FLX machines from Roche (formerly 454). A single sample run on these devices requires approximately 7 hours, and produces between 20 and 30 GB of primary image data. This renders into about 100 Mb of sequence per run. (I have not worked with the Illumina 1G Analyzer yet, but it produces about 1 TB of primary image data over a 3-day run, yielding several hundred million base pairs of sequence.)

Those familiar with chromatograms and microarray images will recognize the split between “primary instrument data” and “finished sequence.” The large, primary data will rapidly become unmanageable, relative to the refined sequence data. Industry groups are already proposing data format and experimental “metadata” standards, which may or may not be applicable as researchers begin using these devices in a production capacity. But whether it is 1 TB per day or every couple of weeks, these devices are going to push the limits of data handling for smaller labs.

Shifting Buckets, Catching Leaks
Read’s lab is planning to capture 0.5 TB of raw image data per month over the next 12 months. This isn’t unreasonable with their current data storage infrastructure, though it’s certainly beyond the “firewire disks on shelves” solution I’ve seen elsewhere. Initially, we plan to keep both raw data and finished sequence on disk. We will then re-assess what must be kept and what can be archived to tape. Larger genome centers don’t have this option, as they are starting off with the unmanageable, today.

We are also considering how to manage the data workflow from sample preparation through comparative genomic presentation. Questions formerly belonging to the big sequencing centers are now pressing problems in smaller labs. Keeping an organized lab notebook, on paper, is insufficient to support a production pipeline of this sort. We’re developing a web-based “dashboard” to allow everyone to understand the progress and provenance of data as it arrives at their door. This begs the question of the pipeline itself, be it scripts, “smart folders,” or some more glitzy workflow solution. At the very least, all data motion and standard processing tasks must be automated to prevent the newly capable lab from burying the informatics staff under a rising tide of trivial data motion tasks.

The term “Sneakernet” was coined decades ago for the network of athletic-shoe clad staff carrying data on physical media from building to building and coast to coast. This practice is now being adopted on research campuses, which suddenly need to move a terabyte here, a terabyte there, out of the lab. As the old saying goes: “Never underestimate the bandwidth of a station wagon filled with disk drives.”

Sustainable Growth
The core problem is not that disks are too small or that network pipes are too narrow.  Slapping a technological fix (expensive switches and a few petabytes of high-end storage in the central facility, for example) on this problem would miss the point entirely. In order for a system to grow in a sustainable manner, users must build locally sustainable islands of functionality. This means bringing the storage and the computing capacity to the lab, rather than shuttling it off to central repositories. This will probably entail either a small cluster or a multi-core machine dedicated to each sequencer, as well as tens of terabytes of disk storage closely paired with the lab equipment.

This shift to local storage and processing will require changes in organizational roles at many levels. The standard institutional answers won’t suffice for a lab that produces TB data per day, especially if the problem expands to every lab on campus. Researchers will need to take an active role in communicating their needs to central IT groups, and listening to the constraints on available IT solutions. Both sides should be compassionate through this process, since it is about to become very easy to demand the impossible.

Major sequencing centers are examining the cost tradeoffs involved in analyzing and then discarding the primary image data, rather than trying to archive it all in a constantly accessible form. It may well be both cheaper and more effective to re-sequence samples of interest, rather than invest in petabyte-scale network storage. My apologies to any experimentalists in shock at the thought of discarding any of their precious primary data! Caution is certainly necessary, and a consideration of all the options. It’s all part of the brave new world of next-generation sequencing. 

And so we await with interest the reaction of the traditional sequence data repositories to the challenges of next-generation sequencers. For now, they have some breathing room, until those express-mail packets of disk begin to arrive at their door.

A final note: The prefixes that follow “Tera,” are “Peta,” then “Exa.” Start getting used to them. These machines are not the pinnacle of a technology, but the first iteration of the next generation. Those that follow will be cheaper, faster, and more accurate. We’ll be talking about data woes for many years to come.

Chris Dwan can be reached at cdwan@bioteam.net.

Photo by Mark Gabrenya

Subscribe to Bio-IT World  magazine.

 

Click here to login and leave a comment.  

0 Comments

Add Comment

Text Only 2000 character limit

Page 1 of 1

White Papers & Special Reports

isilon white paper

“Storage for Science – Methods for Managing Large and Rapidly Growing Data Stores in Life Science Research Environments” sponsored by Isilon
Large and rapidly growing stores of file-based and other data are a hallmark of life science research and bioinformatics. Determining how best to manage those data stores has become a significant challenge for Researchers and IT Pros alike.

This paper is intended to:

  • Provide guidance on the many storage requirements common to Life Science research;
  • Explain the evolution of modern storage architectures;
  • Summarize the major data storage architectures currently in use.

Additionally, it will present the Isilon IQ clustered storage product as a strong and flexible solution to those needs. Download now



definiens briefingon-76Next-Generation Technologies Revolutionizing Oncology and Diagnostics
underwritten by Definiens

This “Briefing On” collection of Bio-IT World features, commentaries and analysis, presents some of the latest thinking on high-throughput technologies that are being applied to the fields of research and drug discovery, with particular emphasis on oncology, diagnostics and imaging technologies. Download now at no charge compliments of the underwriting sponsor, Definiens. Download This Free Paper



metaminer image(1)

MetaMiner™ Cystic Fibrosis Report,  Sponsored by GeneGo
This paper discusses the MetaMiner™ (CF) data analysis platform for a broad range of CF researchers designed to: 1. Easily assemble important biological and chemical experimental data available today in cystic fibrosis research. 2. Visualize key mechanisms leading to the disease through pathway maps and network models 3. Provide the CF community a “one stop shop” tool for uploading and analyzing experimental data in a disease-centered interface.  Download now 



Life Science Webcasts & Podcasts

Storage for Science
Methods for Managing Large and Rapidly Growing Data Stores in Life Science Research Environments

Sponsored by Isilon

Isilon webcast1

Large and rapidly growing stores of file-based and other data are a hallmark of life science research and bioinformatics environments. Determining how best to manage those data stores has become a significant challenge for the Researchers and IT Professionals that support them.

This webcast is intended to: 

  • Provide guidance on the many storage requirements common to Life Science research; 
  • Explain the evolution of modern data storage architectures; 
  • Summarize the major data storage architectures currently in use;
  • Present the Isilon IQ clustered storage product as a strong and flexible solution to those needs.

    Download this webcast

More Podcasts

Job Openings

Isilon Systems ~ Senior Marketing Communications Manager
Isilon Systems is the worldwide leader in clustered storage systems and software for digital content and unstructured data. We seek an experienced marketing communications professional/writer expert in creating and delivering effective and persuasive business communications. The ideal candidate can think at the strategic and conceptual level and act, simultaneously, as a highly-effective and productive individual contributor. The position is based in Seattle, WA. For additional information click here:
 

Lilly Singapore Center for Drug Discovery (LSCDD) - Associate Director of Informatics
Lead and mentor a strong team for the Bioinformatics group at the Integrative Computational Sciences (ICS) department at LSCDD towards the development of novel algorithms, data analysis methods and software tools for drug discovery. Work closely with the Software Engineering group at ICS, and collaborate with the Discovery IT organization in Europe and USA. For additional information, or to apply visit: LSCDD 

For reprints and/or copyright permission, please contact RMS, 1808 Colonial Village Lane, Lancaster, PA;

(717) 399-1900 ext. 125 or via email to bio-itworld@theygsgroup.com.