April 1, 2008 | Kristen Stoops, Helicos' director of informatics business development, is building a federation of IT vendors to identify best practices that will help users manage the torrents of data they will be generating. Here she offers a glimpse of how those plans are progressing.
Bio•IT World: Why does the HeliScope require such an impressive computer server/Analysis Engine?
Stoops: Number one, the HeliScope Single Molecule Sequencer is very image intensive in its technology for generating sequences. We'll produce about 5 Terabytes of image data per day, which presents a daunting challenge from a storage perspective, and the perspective of moving the data from the instrument to external storage. So one of the things we're doing is to limit the amount of data our customers themselves have to store...
[Two,] the HeliScope Analysis Engine is a very high-performance image analysis platform and server that does the image analysis on the fly (and will delete those images as part of the process), and stores a digital representation of those images in an object table, which is roughly 1/10 the size of the full image pack. Every object in every image is represented in the object table... We strip out all the background, we're not saving that, but we are saving every object that we detect, then when we do our base calling, we apply some smart algorithms to figure out if something is an artifact or a real base. From there, we go ahead and form actual sequence data...
We also store 1% diagnostic images that we will save and make available to customers. Those diagnostic images represent random sets of fields of view for each of our channels on our flow cells. So if one of our customers wanted to do their own image analysis, base calling, and strand formation, they could test their own algorithms against those diagnostic images, because it does represent a full stack of images...
The 5 TB data is a full image stack. So for a HeliScope Sequencer run that represents 7 days, that's 35 TB data. So we'll save about 1 percent, or 350 GB data, far more tractable from a storage and management standpoint... It will also allow us to troubleshoot anything that went on during the run that we didn't expect.
How much data can the Analysis Engine store?
[There] is enough data storage for a minimum of two full runs worth of data. That's full runs, minus the full image stack - [i.e.] The diagnostic images, the 1 percent, the object table, the sequence data, and log files.... The data from that first run gets stored on the HeliScope Sequencer, and is completing processing while you start another run, and start saving that data onto the data store on the Analysis Engine. Before you start the third run, you take the data from the first run off the instrument, and delete that data from the Analysis Engine, thereby freeing up enough space to start the third run. So it makes it possible to run the instrument almost continuously...
How will you help customers manage all this data?
We're working with companies like BioTeam and GenomeQuest and a host of other vendors whose technologies can help mitigate the challenges of managing these data... This includes hardware vendors, networking technology, high-performance computing vendors, systems integrators, bioinformatics developers, and basic IT infrastructure companies, LIMS vendors, whose tools in an integrated way can support the pipeline needed to deal and analyze these large amounts of data...
Is your goal to name preferred solutions for these IT and computational needs?
This isn't about naming exclusive types of partners for our system. It's about validating this system... There are so many different factors in choosing the right storage environment. Today, one of the big drivers in choosing hardware is energy costs. You really have to understand the data lifecycle - how many people need to access data? How frequently? How quickly can the data be moved into persistent, less expensive storage, and then into archival storage mode?
What kind of data come off the Analysis Engine?
We've adopted the Short-Read Format as the HeliScope output. This format was developed by a workgroup started by people in the research and genome center community, including Sanger, as well as vendors of the short-read platforms, and is gaining adoption. It's been adopted as the standard for the 1000 Genomes project. NCBI has adopted it as the submission format for the short read archive. It all goes back to our belief in openness, and our open source strategy, and open format where nothing is hidden from the user.
Are potential customers ill prepared for the next-gen data deluge?
Speed every step of the way in dealing with this is going to be key in mitigating the data management challenges. Speed with efficacy - storing data, getting data off the instrument into a storage system, doing it in a manner that is lossless and not sacrificing speed for interruptions in data. Speed with which you can access the data. If you're a user, do you have the right data structures to support pulling out the data you need and just the data you need? Do you have the right tools to analyze the data and get the biology out of these sequences? Understanding just how much of the sequence data you need to accomplish your research goals but not overwhelm the whole data management pipeline. --K.D.
Return to main article.
This article appeared in Bio-IT World Magazine.
Subscriptions are free for qualifying individuals. Apply Today.