Researchers are enthralled by the astonishing throughput and possibilities of next-generation DNA sequencing platforms. But how are they going to manage the torrents of data these prodigious instruments produce?
By Kevin Davies
April 1, 2008 | On March 6, the employees of Expression Analysis (EA) in Durham, North Carolina, took delivery of the first ever commercial single-molecule DNA sequencing instrument, developed by Helicos BioSciences. Understandably, there was a lot of excitement when "that baby rolled through the door," says EA director of marketing Karen Michaelo.
The HeliScope weighs a little under 2000 pounds, and comes with a separate 32-CPU Dell PowerEdge server stack and a whopping 24 Terabytes (TB) of storage, which will be installed at EA by a pair of Helicos employees. "We're going to skin our knees for a while," says Michaelo, who anticipates offering sequencing services from the platform this summer.
During a tour of the Helicos production floor in Cambridge, Mass., senior VP of product research and development Bill Efcavitch stressed the platform's processing prowess. Customers are going to need it. Today, the HeliScope produces 25 Mb sequence per hour, but Efcavitch predicts a greater than tenfold increase within the next two years, putting the $1,000 genome firmly within reach.
The challenge facing next-generation platform users is how to manage the data glut pouring forth from 454, Illumina, and Applied Biosystems (ABI) instruments and the like. "Fundamentally what the community suffers from is, there's no best practices guide for setting up a lab with this equipment," says Geospiza founder Todd Smith.
For example, Smith estimates an Illumina Genome Analyzer produces 1500 times more data per run than an ABI 3730 capillary instrument. Even the major genome centers are scrambling. At the Broad Institute, a warehouse of 100 ABI 3730s produced 60 Gigabases (billion bases) of sequence last year, estimates director of informatics for sequencing platforms Toby Bloom. The institute's 20 Illumina instruments currently generate 20 Gigabases per week, and this could double in the near future.
"There's not even enough local storage on the data collection machine to complete a single paired end run, in the way we're using them, which means we have to be much more aggressive about moving data through the pipeline. So we've shifted to a pull model rather than a push model for moving the data along," says Bloom's colleague Matthew Trunnell, group leader of application and production systems. (See, "A Broad View")
A run on ABI's SOLiD instrument produces about two Gigabases, says ABI's Michael Rhodes, sounding somewhat awestruck. "I don't think I've really seen anyone who's first time with [SOLiD] who has not been somewhat overwhelmed by that amount of data. They may have rationally dealt with it in their head, but there really are challenges in just moving that amount of data around." Rhodes' team sometimes resorts to burning data on hard drives to ship between sites.
"A lot of smaller [organizations] might have only a few next-generation instruments, but they think about the science primarily," says BlueArc research markets director, James Reaney. "They have no ability to handle data of this size easily."
SGI's global higher education and research solutions manager, Deepak Thakkar, hears scientists complain about offloading Terabytes of data offline before they can start a new run. Says Thakkar: "Even if I don't keep 90% of the data, I still need to do the right thing before I decide to trash it. I don't need to keep every 5 or 10 TB, but I need to know what to keep!"
"The data glut is a huge problem," agrees Steven Salzberg, a bioinformatician at the University of Maryland. Salzberg says his lab (using Illumina instruments) "keeps the sequences and quality values, and throws away pretty much everything else almost right away. It's cheaper to re-sequence than to store the raw data for the long term."
With hundreds of next-gen instruments deployed in the past 1 to 2 years, and more platforms on the horizon, the deluge is just beginning.
Illumina CIO Scott Kahn says the Genome Analyzer ships with "an instrument control PC that has adequate storage to collect all the images, so at the end of the run, they can be transferred and then processed with an offline pipeline that runs across a Linux cluster."
Kahn characterizes three broad groups of users. The most proficient customers, including genome centers such as the Broad, Wellcome Trust Sanger Institute, and Washington University, transfer data in real time as the runs are proceeding to gauge run quality and determine whether additional cycles are desirable. A second group of "quite sophisticated" users "will use mechanisms that we've provided to transfer data off the machine after the run has completed." The third camp doesn't have dedicated bioinformatics resources, and puts a premium on ease of use.
ABI offers a 12-core server with its SOLiD instrument providing 9 TB storage, which stores the images during the analysis and some of the results. "When you do the next run, you get rid of the pictures," says Rhodes. Problem solved!
Helicos' Efcavitch is realistic about the data handling challenges. "The amount of image data we're collecting is staggering," he admits. During the data acquisition phase, lasting a week or more, the HeliScope Analysis Engine cluster processes the image data in real time down to a sequence file. "To store that data would be just cost prohibitive. So we take the image data, we process it so all we're saving is 1 percent of the image data for diagnostic purposes and a sequence file."
The server stack can hold data from two separate runs. "We put enough room to save one run, start it up, transfer the data out, now you're ready for a new run," says Efcavitch. With some platforms, "you can't transfer off all the data, so they save all the data on a hard disk. You have to transfer that off [and] do that image processing."
Efcavitch says Helicos intends to make its software open source. Meanwhile, his colleague Kristen Stoops, director of informatics business development, is building a "Helicos IT pipeline" for customers. "It does require some effort on [the users'] part to think about who is going to use the data, and how it needs to be moved through various levels of access and storage," she says. (See, "Helicos' IT Pipeline")
Helping users to do that is a growing number of software firms and consultancies, who see an ever broadening niche to be exploited.
Todd Smith's company, Geospiza, which builds IT infrastructures for core laboratories that provide centralized genomics and proteomics services applications, releases its new platform for next-gen labs this month and hopes to bring its first customers online shortly thereafter.
Core labs have to recognize that data management is for their users as much as them, says Smith. "Traditionally, core labs would send data or the researchers would download the sequence data, and use desktop software to analyze the data," says Smith. "Next-gen changes that. The first thing the core lab experiences is: How are we going to get the data to our researchers? How will they access the analysis tools and CPU clusters needed to consume the data we give them?"
And then there's the metadata. "Information about images, redundancy in datasets scattered in directories," says Smith. "So there's a lot of complexity within the data that people are trying to sort out."
Managing next-gen data is about collection and distribution, Smith reckons. "You have to be able to correlate the runs with different samples. One researcher might have one sample, another four samples." A next-gen instrument can "spit out a bunch of files. I need to link those files to the run and to the sample, and then make the data available to my end researcher."
Geospiza's FinchLab, "Next-Gen Edition," will be delivered as a hosted service (SaaS) or as a "drop-in" operational bundle that includes a LIMS for collecting the data, a Dell server, and a 7 TB Isilon clustered storage (scalable to 1500 TB) to make the data accessible. Seven TB may suffice for a year or so. "Once [users] get good at things and start getting creative, they start looking at 100 TB," says Smith..
The LIMS maps every step and order to a specific workflow. "You can't just park data on the instruments," says Smith. "Those files need to be moved to a central server, because the instruments have 10-13 TB storage, but when you think of all the image files to be processed, that storage is used up," says Smith. "The computers are only for the data processing, they're not part of the data management."
Smith advises moving the data to a data management system once, granting customers access through a web interface to download data if necessary. "Once the data is in the data management system, cloud computing can play a big role in moving the algorithms to the data. We believe this will be far more practical and cost effective for researchers - that's our goal," says Smith.
For Canada's GenoLogics, there is an opportunity to find new users for its Geneus lab and data management system, which is primarily deployed in gene expression and genotyping settings. James DeGreef, VP market strategy, says that many of his customers have or are purchasing next-gen systems. "We took our core systems, and built it out to handle all the LIMS aspects of next-gen sequencing capabilities," says DeGreef. The goal is also to provide the bioinformatics capabilities of next-gen sequencing.
GenoLogics is currently developing its resource with the University of Pittsburgh core lab and the University of British Columbia Genome Sciences Center. DeGreef anticipates a full release later this year.
On a Quest
Geospiza recently joined ABI's next-generation software consortium. "We want to enable researchers to do a lot of downstream analysis," says ABI's Roger Canales, a senior program manager. "We provide [vendors] with data file formats and information about how to handle and process the data, to facilitate the development of these software tools."
Another member of the consortium is GenomeQuest. "Current IT architectures and components cannot be easily adapted to process the volume and scale of data," says president and CEO, Ron Ranauro. "The consequence is that next-generation sequencing is [otherwise] limited to a few leading organizations with the most advanced bioinformatics staffs."
Ranauro's company (see 2007 story) has spent several years developing a platform for biopharma customers to take advantage of the next-gen revolution. "Just as Google did for indexing the web, GenomeQuest assessed the existing IT components available for indexing and searching sequence data. The only way to achieve scalability and performance was to start from scratch," Ranauro says.
The GenomeQuest system manages the flow of data across large computer networks so that any one computer only operates on a small part of the search, while the central system coordinates the flow of algorithms and data and collates results. "The benefits are infinite, linear scaling of computation and fully managed data resources," says Ranauro. "We shorten the time it takes to turn next-gen sequence data into biological meaning."
The two main classes of workflows are reference-guided assemblies (or variant analysis) and all-against-all (for metagenomics and transcriptomics applications), which can be adapted to specific customer needs. "The system is open to allow scripted access or web linking," says Ranauro. "Either way, access to large scale computational and data management resources are completely virtualized. An all-against-all, metagenomics workflow takes about ten lines of code in our system."
The BioTeam's managing partner, Stan Gloss, sees the next-gen field becoming 50% of his business in the next six months. "The marketplace is moving that quickly," he says. (See, "Next-Generation Sequencing Solutions," Bio•IT World, October 2007). BioTeam senior consultant Michael Cariaso says next-gen instruments provide basic software processes culminating in a directory of files, but "that's pretty much where every vendor is going to leave you. The more they do, the less likely it is to fit the way your lab works with data. Two next-gen machines sitting side-by-side have no knowledge of each other, and the vendor software does little to improve that."
And so BioTeam has developed a Wiki Laboratory Information Management System (see p. 12), which resembles a mediaWiki installation. Explains Cariaso: "Every time a [sequencing] run finishes, the same way it might write to a log file, it also logs the results into the Wiki. As we do that for multiple platforms, it becomes a one-stop shop for all devices as well as the pipelines you have in house. It quickly becomes an umbrella over every other workflow process and data source within the facility."
BioTeam has installed WikiLIMS with groups including Dick McCombie's team at Cold Spring Harbor Laboratory and Tim Read at the Naval Medical Research Center in Silver Spring, Maryland.
For a researcher working on a particular project, (s)he "hits one button on the wiki, it says, here are all the microarrays and next-gen runs you've done for this. It will build a nice table, with quality scores, matrices, graphs... It can look up projects and look across timelines. New data arrives automatically as it is generated. This provides centralized and up-to-date view of what's really happening in your lab."
The WikiLIMS also provides the ability to integrate CGI scripting into the wiki. For example, Cariaso says the 454 software for base calling is good, but assembling multiple runs into a consensus requires someone to work at the UNIX command line. "We can come instead and say, 'Here's your project page and a list of all the runs you've done. Check which runs you want, hit the big red button, and launch the assembly, store the results back in the Wiki.' We can figure out the workflows in that environment and make them a single-button press."
Minding the Store
The rise of next-gen sequencing applications is fueling rabid demand for new storage solutions. (See, "Isilon Insights") Illumina's Kahn says the choice "is typically up to the specific environment, how coupled it is to Linux or Windows, the price point, the amount of storage needed, and the retention policies."
BlueArc's James Reaney sees the three key issues as storage bandwidth, computational analysis, and data retention/migration policies. "The architecture must be of robust enough design to solve all three bottlenecks," says Reaney.
As the Wellcome Trust Sanger Institute and Washington University in St. Louis can attest, BlueArc's Titan network storage platform provides workflow paralellization and easy upgrades. The new Titan 3200 platform doubles the storage of its predecessor (see p. 49), with 20 Gbps of fully bi-directional, non-blocking storage bandwidth. Reaney says the bladed modular architecture uses tiered storage, which can be configured to suit the user, offering a range of solid-state, fibre channel, and SATA disk, plus tape backup.
"The Titan 3200 is easily the performance leader," says Reaney. "Combined with a modular, upgradable architecture and now a maximum capacity of four usable petabytes, the Titan platform is also the most future-proof storage platform one could have."
According to SGI's Thakkar, most of SGI's deployment is with individual sequencing labs and pharma companies, rather than genome centers. Thakkar explains: "[Pharma] buys something for a particular program or application. Extra capacity is not something they usually like to keep on hand. They don't want to incur extra overheads."
SGI works with customers in three major areas. It offers a bioinformatics appliance to accelerate genomics and proteomics applications and high-performance compute power - hybrid computing with shared memory systems linked to clusters, where SGI can take sequencing information offline and use its large memory systems to compare against known databases. He says platform vendors would benefit if there was a "faster, more streamlined" way to take the data offline for high-throughput computational analysis.
But SGI's "mainstay is raw storage," says Thakkar. This could be three-tiered storage or storage to fit the needs of the data being produced. "Analysis is key to what you end up re-running: Do I go back and rerun the experiment, or do I just go ahead and save it and run another experiment? Having intuitive and efficient analysis systems are key," says Thakkar.
Thakkar anticipates more SOLiD installations this year, especially as users contemplate replacing some of their capillary instruments. He adds that "454 has done an excellent job of figuring out the entire workflow for the customer, maybe because they've been doing this for a bit longer. They've got many of the kinks figured out, especially on the storage end."
Asked about her biggest headaches, the Broad's Toby Bloom cites fault tolerance as the biggest problem. "We've always had a notion of a) storing our data for ever and b) having multiple places in our pipeline we could fall back to if we lost something. If something went down, we could queue up data behind it until the component was fixed. If something got corrupted, there was always a fallback going to the previous step. We can't do that anymore... We can't even afford to keep everything backed up on tape! All of that is the scariest part of all of this."
"As fast as Moore's Law is working to support us, we're still eclipsing it in our ability to generate sequence data," says Helicos' Stoops. Efcavitch, her colleague, hopes it stays that way. "With a simple improvement to our efficiency and error rate, we'll be at 90 Mb/hr, with the same hardware, simply changing the chemistry," he says. "And by increasing the density of molecules on the surface, we'll be at 360 Mb/hr, with the same imaging hardware." All within two years.
There could be an awful lot of drenched scientists and skinned knees by then.
This article appeared in Bio-IT World Magazine.
Subscriptions are free for qualifying individuals. Apply Today.