Of Sequences and Sneakernets



By Chris Dwan
Loading...
June 13, 2007 | It is all but passé to mention the flood of primary laboratory data that has shaped bioinformatics over the past several decades. First automated sequencing, then microarrays, and now next-generation sequencers have pushed the capacities of both storage and computational systems. Exponential growth has characterized the public sequence repositories for so long that I rarely use my tired “sequences stored in NCBI over time,” slide anymore. Everybody already knows the story.

It is with some hesitation, therefore, that I wade into another column about yet another flood of data. This one is interesting, however, because it will not come from the large sequencing centers, but from a large number of smaller labs as they start to use next-generation sequencing machines. Over the past few months, I’ve spent time working with some early adopters of the technology — Timothy Read’s group at the Navy Medical Research Center in Rockville, Maryland. What follows are some observations based on that experience.

For the past three decades, the vast majority of DNA sequences were read using variants of Sanger sequencing. Clever engineering and chemistry pushed the data generated by a single run of a sequencing machine from hundreds of bases to around 1500 bases. Sequencing even the smallest genomes incurred a huge burden of lab work. Moreover, scientists had to prepare thousands of samples and analyze thousands of independent result files.

InsideBox.jpgNext-generation sequencing machines (see “The Drive for the $1000 Genome,” Bio•IT World, May 2007) use completely different biochemical technologies and do not appear to suffer the same limitation. A single instrument run produces information on hundreds of millions of base pairs. For many prokaryotes, this can remove entirely the requirement of cloning out fragment libraries and performing a final assembly — a single machine run can read every base pair of its genome many times over.  For larger genomes, multiple runs are still necessary, but perhaps 100,000 fold fewer. As always, there are certain interesting exceptions. That said, with an initial investment of several hundred thousand dollars in a sequencing machine, plus costs of between $5,000 and $10,000 in reagents per run, a small lab could produce unprecedented amounts of data. 

Read’s group plans to sequence several strains of pathogenic microbes per week, every week, for the next several years. They will do this with a lab staff of fewer than 10 people on a pair of FLX machines from Roche (formerly 454). A single sample run on these devices requires approximately 7 hours, and produces between 20 and 30 GB of primary image data. This renders into about 100 Mb of sequence per run. (I have not worked with the Illumina 1G Analyzer yet, but it produces about 1 TB of primary image data over a 3-day run, yielding several hundred million base pairs of sequence.)

Those familiar with chromatograms and microarray images will recognize the split between “primary instrument data” and “finished sequence.” The large, primary data will rapidly become unmanageable, relative to the refined sequence data. Industry groups are already proposing data format and experimental “metadata” standards, which may or may not be applicable as researchers begin using these devices in a production capacity. But whether it is 1 TB per day or every couple of weeks, these devices are going to push the limits of data handling for smaller labs.

Shifting Buckets, Catching Leaks
Read’s lab is planning to capture 0.5 TB of raw image data per month over the next 12 months. This isn’t unreasonable with their current data storage infrastructure, though it’s certainly beyond the “firewire disks on shelves” solution I’ve seen elsewhere. Initially, we plan to keep both raw data and finished sequence on disk. We will then re-assess what must be kept and what can be archived to tape. Larger genome centers don’t have this option, as they are starting off with the unmanageable, today.

We are also considering how to manage the data workflow from sample preparation through comparative genomic presentation. Questions formerly belonging to the big sequencing centers are now pressing problems in smaller labs. Keeping an organized lab notebook, on paper, is insufficient to support a production pipeline of this sort. We’re developing a web-based “dashboard” to allow everyone to understand the progress and provenance of data as it arrives at their door. This begs the question of the pipeline itself, be it scripts, “smart folders,” or some more glitzy workflow solution. At the very least, all data motion and standard processing tasks must be automated to prevent the newly capable lab from burying the informatics staff under a rising tide of trivial data motion tasks.

The term “Sneakernet” was coined decades ago for the network of athletic-shoe clad staff carrying data on physical media from building to building and coast to coast. This practice is now being adopted on research campuses, which suddenly need to move a terabyte here, a terabyte there, out of the lab. As the old saying goes: “Never underestimate the bandwidth of a station wagon filled with disk drives.”

Sustainable Growth
The core problem is not that disks are too small or that network pipes are too narrow.  Slapping a technological fix (expensive switches and a few petabytes of high-end storage in the central facility, for example) on this problem would miss the point entirely. In order for a system to grow in a sustainable manner, users must build locally sustainable islands of functionality. This means bringing the storage and the computing capacity to the lab, rather than shuttling it off to central repositories. This will probably entail either a small cluster or a multi-core machine dedicated to each sequencer, as well as tens of terabytes of disk storage closely paired with the lab equipment.

This shift to local storage and processing will require changes in organizational roles at many levels. The standard institutional answers won’t suffice for a lab that produces TB data per day, especially if the problem expands to every lab on campus. Researchers will need to take an active role in communicating their needs to central IT groups, and listening to the constraints on available IT solutions. Both sides should be compassionate through this process, since it is about to become very easy to demand the impossible.

Major sequencing centers are examining the cost tradeoffs involved in analyzing and then discarding the primary image data, rather than trying to archive it all in a constantly accessible form. It may well be both cheaper and more effective to re-sequence samples of interest, rather than invest in petabyte-scale network storage. My apologies to any experimentalists in shock at the thought of discarding any of their precious primary data! Caution is certainly necessary, and a consideration of all the options. It’s all part of the brave new world of next-generation sequencing. 

And so we await with interest the reaction of the traditional sequence data repositories to the challenges of next-generation sequencers. For now, they have some breathing room, until those express-mail packets of disk begin to arrive at their door.

A final note: The prefixes that follow “Tera,” are “Peta,” then “Exa.” Start getting used to them. These machines are not the pinnacle of a technology, but the first iteration of the next generation. Those that follow will be cheaper, faster, and more accurate. We’ll be talking about data woes for many years to come.

Chris Dwan can be reached at cdwan@bioteam.net.

Photo by Mark Gabrenya

Subscribe to Bio-IT World  magazine.

 

Click here to login and leave a comment.  

0 Comments

Add Comment

Text Only 2000 character limit

Page 1 of 1

White Papers & Special Reports

Quantum
StorNext 4.0: Technical Product Brief
Sponsored by Quantum

 
Proven in the world’s most data intensive industries, Quantum StorNext is a scalable, high-performance file system which allows data sharing across Linux, Mac, Unix, and Windows operating systems and manages data in enterprise storage environments. In this Technical Brief you'll learn:

  • How a high-performing file system can accelerate your business
  • How to simplify your data management
  • How a tiered storage approach can save you money


SURETY-IP_WPx108
Protect Your Scientific Intellectual Property: Proof of Lab Informatics Data Authenticity is Your Best Legal Defense
Sponsored by Surety, LLC

As a bio-technology or life sciences organization, your formulas, treatments and research and discoveries are the “lifeblood” of your business. But if you aren't protecting the integrity of your scientific data in your lab informatics systems, you risk losing IP ownership, revenue and consequently your business if you can't prove time-of-creation and data authenticity. Learn how you can implement simple, cost-effective and automated controls to protect your scientific intellectual property. Consider:

  • IP protection requirements in bio-pharma and other science-oriented industries can extend out 20, 30, 40 or more years
  • Most electronic lab management solutions include generic authenticity controls, so how "legally defensible" is yours?
  • Only standards-compliant, independent controls can future-proof your approach to long-term IP integrity protection and authenticity.
  • Learn more - get the free whitepaper now


BlueArc_WP_DataMigration.jpg
The Key to Life Sciences Data Management: Transparent Migration
Sponsored by BlueArc

Life sciences organizations face new data management challenges as the volume of research data grows and more data is kept online for longer times. Read this paper to learn about:

  • The benefits of transparent data migration (TDM)
  • How TDM technologies can simplify data management.
  • How using TDM can help increase storage utilization, improve computational workflow performance, and optimize the use of storage resources.


Life Science Webcasts & Podcasts

adobe_i3_btn_webinarNext-Generation Clinical Trial and Data Management Applications
Sponsored by Adobe

This webinar introduces i3Cube - a web-based, fully integrated, clinical trial and data management system built on Adobe’s LiveCycle® Enterprise Suite.  I3 cube provides end-to-end automation that delivers unprecedented visibility into information that sponsors need to accelerate the study process and complete trials efficiently. Viewers will learn more about:

  • Creating faster and more efficient trial processes
  • Reducing investigator burden 
  • Real-time sponsor transparency into study information
  • Enterprise solutions based on Adobe LiveCycle® ES utilizing cross-platform clients of Reader, Flash and AIR

    Download now.



More Podcasts

Job Openings

Employers -- Don't miss this opportunity to reach well-qualified life science candidates.

Loading...

For reprints and/or copyright permission, please contact The YGS Group, 3650 West Market Street, York, PA;

(717) 505-9701 ext. 125, or via email to Ashley.Zander@theYGSgroup.com.