Of Sequences and Sneakernets



By Chris Dwan
June 13, 2007 | It is all but passé to mention the flood of primary laboratory data that has shaped bioinformatics over the past several decades. First automated sequencing, then microarrays, and now next-generation sequencers have pushed the capacities of both storage and computational systems. Exponential growth has characterized the public sequence repositories for so long that I rarely use my tired “sequences stored in NCBI over time,” slide anymore. Everybody already knows the story.

It is with some hesitation, therefore, that I wade into another column about yet another flood of data. This one is interesting, however, because it will not come from the large sequencing centers, but from a large number of smaller labs as they start to use next-generation sequencing machines. Over the past few months, I’ve spent time working with some early adopters of the technology — Timothy Read’s group at the Navy Medical Research Center in Rockville, Maryland. What follows are some observations based on that experience.

For the past three decades, the vast majority of DNA sequences were read using variants of Sanger sequencing. Clever engineering and chemistry pushed the data generated by a single run of a sequencing machine from hundreds of bases to around 1500 bases. Sequencing even the smallest genomes incurred a huge burden of lab work. Moreover, scientists had to prepare thousands of samples and analyze thousands of independent result files.

InsideBox.jpgNext-generation sequencing machines (see “The Drive for the $1000 Genome,” Bio•IT World, May 2007) use completely different biochemical technologies and do not appear to suffer the same limitation. A single instrument run produces information on hundreds of millions of base pairs. For many prokaryotes, this can remove entirely the requirement of cloning out fragment libraries and performing a final assembly — a single machine run can read every base pair of its genome many times over.  For larger genomes, multiple runs are still necessary, but perhaps 100,000 fold fewer. As always, there are certain interesting exceptions. That said, with an initial investment of several hundred thousand dollars in a sequencing machine, plus costs of between $5,000 and $10,000 in reagents per run, a small lab could produce unprecedented amounts of data. 

Read’s group plans to sequence several strains of pathogenic microbes per week, every week, for the next several years. They will do this with a lab staff of fewer than 10 people on a pair of FLX machines from Roche (formerly 454). A single sample run on these devices requires approximately 7 hours, and produces between 20 and 30 GB of primary image data. This renders into about 100 Mb of sequence per run. (I have not worked with the Illumina 1G Analyzer yet, but it produces about 1 TB of primary image data over a 3-day run, yielding several hundred million base pairs of sequence.)

Those familiar with chromatograms and microarray images will recognize the split between “primary instrument data” and “finished sequence.” The large, primary data will rapidly become unmanageable, relative to the refined sequence data. Industry groups are already proposing data format and experimental “metadata” standards, which may or may not be applicable as researchers begin using these devices in a production capacity. But whether it is 1 TB per day or every couple of weeks, these devices are going to push the limits of data handling for smaller labs.

Shifting Buckets, Catching Leaks
Read’s lab is planning to capture 0.5 TB of raw image data per month over the next 12 months. This isn’t unreasonable with their current data storage infrastructure, though it’s certainly beyond the “firewire disks on shelves” solution I’ve seen elsewhere. Initially, we plan to keep both raw data and finished sequence on disk. We will then re-assess what must be kept and what can be archived to tape. Larger genome centers don’t have this option, as they are starting off with the unmanageable, today.

We are also considering how to manage the data workflow from sample preparation through comparative genomic presentation. Questions formerly belonging to the big sequencing centers are now pressing problems in smaller labs. Keeping an organized lab notebook, on paper, is insufficient to support a production pipeline of this sort. We’re developing a web-based “dashboard” to allow everyone to understand the progress and provenance of data as it arrives at their door. This begs the question of the pipeline itself, be it scripts, “smart folders,” or some more glitzy workflow solution. At the very least, all data motion and standard processing tasks must be automated to prevent the newly capable lab from burying the informatics staff under a rising tide of trivial data motion tasks.

The term “Sneakernet” was coined decades ago for the network of athletic-shoe clad staff carrying data on physical media from building to building and coast to coast. This practice is now being adopted on research campuses, which suddenly need to move a terabyte here, a terabyte there, out of the lab. As the old saying goes: “Never underestimate the bandwidth of a station wagon filled with disk drives.”

Sustainable Growth
The core problem is not that disks are too small or that network pipes are too narrow.  Slapping a technological fix (expensive switches and a few petabytes of high-end storage in the central facility, for example) on this problem would miss the point entirely. In order for a system to grow in a sustainable manner, users must build locally sustainable islands of functionality. This means bringing the storage and the computing capacity to the lab, rather than shuttling it off to central repositories. This will probably entail either a small cluster or a multi-core machine dedicated to each sequencer, as well as tens of terabytes of disk storage closely paired with the lab equipment.

This shift to local storage and processing will require changes in organizational roles at many levels. The standard institutional answers won’t suffice for a lab that produces TB data per day, especially if the problem expands to every lab on campus. Researchers will need to take an active role in communicating their needs to central IT groups, and listening to the constraints on available IT solutions. Both sides should be compassionate through this process, since it is about to become very easy to demand the impossible.

Major sequencing centers are examining the cost tradeoffs involved in analyzing and then discarding the primary image data, rather than trying to archive it all in a constantly accessible form. It may well be both cheaper and more effective to re-sequence samples of interest, rather than invest in petabyte-scale network storage. My apologies to any experimentalists in shock at the thought of discarding any of their precious primary data! Caution is certainly necessary, and a consideration of all the options. It’s all part of the brave new world of next-generation sequencing. 

And so we await with interest the reaction of the traditional sequence data repositories to the challenges of next-generation sequencers. For now, they have some breathing room, until those express-mail packets of disk begin to arrive at their door.

A final note: The prefixes that follow “Tera,” are “Peta,” then “Exa.” Start getting used to them. These machines are not the pinnacle of a technology, but the first iteration of the next generation. Those that follow will be cheaper, faster, and more accurate. We’ll be talking about data woes for many years to come.

Chris Dwan can be reached at cdwan@bioteam.net.

Photo by Mark Gabrenya

Subscribe to Bio-IT World  magazine.

 

Click here to login and leave a comment.  

0 Comments

Add Comment

Text Only 2000 character limit

Page 1 of 1



White Papers & Special Reports

sgi whp 2
Managing the Modern Genomics Data Flood
Sponsored by SGI

Managing and storing the perfect storm of multi-disciplined data pouring from next generation sequencers and other omics instruments is a central challenge in life sciences. Discover in this paper how the SGI ArcFiniti storage solution, optimized for unstructured genomics and life sciences data can: 

  • Reduce costs, proactively protect data integrity, and deliver the high performance I/O required for genomics data processing and analysis.  
  • Effectively manage capacities from 156TB to 1.4PB as a disk based, integrated hardware and software platform 


sgi - whp 1
Turning Genomics Data into Practical Insight
Sponsored by SGI

With worldwide sequencing capacity approaching 13 quadrillion DNA bases annually turning genomics data into knowledge is a true computational challenge. Read this paper and learn how the SGI UV coherent shared memory platform can:  

  • Speed results time while cost competitively tackling the most difficult computational problems across all omics disciplines. 
  • Push performance by scaling to extraordinary levels, up to 256 sockets (2,560 cores, 4,096 threads) per single system (one OS image). 

Provide support for up to 16TB of coherent shared memory in a single system image enabling extreme efficiency across a wide range of compute demands. 



accerlys-logo_2012_wh
New Complimentary Market Survey…
Collaborations and Communications Within Drug Discovery Research
Sponsored by Accelrys
This survey was conducted by the Cambridge Healthtech Media Group in January, 2012. It was sponsored by Accelrys related to their HEOS initiative to gather valid information around externalizing collaborative research while improving communications in the cloud. With 310 qualified industry respondents the survey findings reveal useful usage and trends patterns.  An insightful follow-on discussion and webinar related to this survey, and the HEOS by Scynexis SaaS portal is also available on the Bio-IT World website for complementary viewing.
 


Job Openings

tessella logo 
Scientific Software Engineer
Boston MA
$70,000 to $95,000
 

Tessella delivers software engineering and consulting services to leading pharmaceutical and biotech companies. We are recruiting Software Engineersto work with skilled bioinformaticians and scientists to identify business needs and recommend and develop technical solutions. Applicants require BS, MS or PhD in bioinformatics, biology or chemistry and 2+ years of software development in either: Java, C#, C++, C or VB.NET. 

Apply at http://jobs.tessella.com   

 

oxford nanopore logo 


 Early Access Collaborations Managers
Oxford Nanopore Technologies is developing a novel technology, GridIONTM for the direct, electronic analysis of DNA/RNA and other analytes.  As the system approaches the market, we are building a team of technically knowledgeable, highly motivated candidates with excellent customer service and facilitation skills to join our company as Collaboration Managers.  This is a unique opportunity to work with world-leading genomics customers throughout the early adoption phase of a new generation of DNA sequencing technology.. This is a facilitative, enabling role with responsibility for managing technology development collaborations with key customers at leading genomics institutions.  It will include long term management of the collaboration plan and milestones and associated meetings and documentation. Click here to find out more and apply   

Oxford Nanopore's GridION technology, VP, Sales and Marketing Oxford Nanopore Technologies is a fast-moving technology company that is developing a novel electronic molecular analysis technology. The technology is adaptable for the analysis of DNA/RNA, proteins, chemicals and other molecules.  It is therefore suitable for use in a variety of markets including scientific research and clinical applications.  As the technology approaches the market, Oxford Nanopore is seeking a visionary VP of sales and marketing to join the senior team.  The candidate will embrace the opportunities afforded by entering the market with a truly disruptive technology that has the potential to expand the number of users and the variety of applications in each target market.  This is a rare opportunity to influence the commercial strategy at an early phase of its commercial lifetime, in a well funded company.  Oxford Nanopore welcomes applications from candidates with a track record of high-level strategic commercial  leadership, who wish to apply a fresh approach to existing markets.  Experience in Life Sciences/DNA sequencing is central to this role, however we will consider your application if you have experience of disruptive technologies in other related industries.  We are particularly interested in candidates with strong expertise in the use of digital technologies for sales and marketing of scientific/technical products.  Click to  Apply  


 

For reprints and/or copyright permission, please contact  Tim McLucas, (781) 972-1342, tmclucas@healthtech.com .