March 12, 2008 | Last week, Helicos BioSciences announced that it had shipped its first $1.35 million single-molecule DNA sequencing instrument to a company called Expression Analysis in North Carolina. Kristen Stoops, Helicos' director of bioinformatics, is building a federation of IT vendors to identify best practices that will help users manage the inevitable torrents of data they will be generating. She offered Bio-IT World’s Kevin Davies a glimpse of how those plans are progressing.
Bio-IT World: The HeliScope comes with an impressive computer server or Analysis Engine. Why did you decide to go that route?
Stoops: There were a couple of motivations behind providing such a powerful box for image analysis and primary analysis of the data. Number one, the HeliScope Single Molecule Sequencer is very image-intensive in its technology for generating sequences. We'll produce about 5 terabytes (TB) of image data per day, which presents a daunting challenge from a storage perspective, and the perspective of moving the data from the instrument to external storage. So one of the things we're doing is to limit the amount of data our customers themselves have to store…
[Second,] the HeliScope Analysis Engine is a very high-performance image analysis platform and server that does the image analysis on the fly (and will delete those images as part of the process), and store a digital representation of those images in an object table, which is roughly 1/10 the size of the full image pack.
Every object in every image is represented in the object table. If you wanted to, you could recreate the images from the object table... We strip out all the background, we're not saving that, but we are saving every object that we detect, then when we do our base calling, we apply some smart algorithms to figure out if something is an artifact or a real base. From there, we go ahead and form actual sequence data...
We also store one percent of diagnostic images that we will save and make available to customers. Those diagnostic images represent random sets of fields of view for each of our channels on our flow cells. So if one of our customers wanted to do their own image analysis, base calling and strand formation, they could test their own algorithms against those diagnostic images, because it does represent a full stack of images from between 10 and 20 fields of view per channel.
The 5 TB data is a full image stack. So for a HeliScope Sequencer run that represents seven days, that's 35 TB data. So we'll save about one percent, or 350 GB data, far more tractable from a storage and management standpoint than something that is two orders of magnitude higher... It will also allow us to troubleshoot anything that went on during the run that we didn't expect.
How much data can the Analysis Engine store?
Stoops: Another efficiency we've tried to build into the Analysis Engine is enough data storage for a minimum of two full runs worth of data. That's full runs, minus the full image stack -- [i.e.] The diagnostic images, the one percent, the object table, the sequence data, and log files. That's the information that actually gets stored on the Analysis Engine on a run-by-run basis that customers have access to...
We want people to be able to run the instrument with end-to-end runs, one right after another. So if you start a run, the data from that first run gets stored on the HeliScope Sequencer, and is completing processing while you start another run, and start saving that data onto the data store on the Analysis Engine. Before you start the third run, you take the data from the first run off the instrument, and delete that date from the Analysis Engine, thereby freeing up enough space to start the third run. So it makes it possible to run the instrument almost continuously...
There is a few hours of strand formation after the end of a run... Then you've got to pull the data off the HeliScope Instrument. That takes on the order of hours -- it is a lot of data.
How will you be helping customers manage all this data?
Stoops: We believe there are a lot of resources in the commercial and academic community that, if brought together effectively and in a tailored way for our system, can provide a lot of value. So we're working with companies like BioTeam and GenomeQuest and a host of other vendors whose technologies can help mitigate the challenges of managing these data.
One of my challenges is to bring together a partner program of vendors whose technologies will help mitigate these challenges. This includes hardware vendors, networking technology, high-performance computing vendors, systems integrators, bioinformatics developers, and basic IT infrastructure companies, LIMS vendors, whose tools in an integrated way can support the pipeline needed to deal and analyze these large amounts of data...
I believe that to go out and just partner ad hoc and present those individual solutions is going to be a lot less helpful than working with a core group of best-of-breed partners, who integrate their technologies together to form a pipeline, which can then be validated at Helicos in the context of our own research collaborations. Then we can take that solution -- a full HeliScope IT pipeline -- from these various vendors out to the community as a model for what we know works from our own internal research...
It isn't intended to be an exclusive kind of partner program at all. We'd welcome other vendors coming in, and would work with them to develop other models that might suit different types of customers. It's really one step beyond the solution-based partnerships that one sees in the life sciences/IT space. But one that is a solution around an entire instrument... If a customer had to do this on their own, it would be very time-consuming, very expensive, and probably at the expense of running the instrument to do their own research. It makes much more sense for us to do this and to be able to make those recommendations to customers.
How are those vendor partner discussions going?
Stoops: Those conversations have been around understanding the capabilities of their technology and how well suited they are to what we want to do. [I'm] presenting this idea to them, assessing their interest... It's pretty self-evident that the market is there and there is a business opportunity associated with next-gen platforms. So that part of partnering has already taken care of itself. But [I'm] gauging their buy-in and trying to understand which partners would work best together... I've talked to a number of partners in all these different areas, and have a pretty good list in mind. Once we have instruments in house running research collaborations, we can really put this to work.
Is your goal to name preferred partners or solutions for these IT and computational needs?
Stoops: This isn't about naming exclusive types of partners for our system. It's about validating this system, and there may be others. There will always be advances in technology that we'll want to keep abreast of. There are so many different factors in choosing the right storage environment. Today, one of the big drivers in choosing hardware are energy costs. You really have to understand the data lifecycle -- how many people need to access data? How frequently? How quickly can the data be moved into persistent, less-expensive storage, and then into archival storage mode?
I've recently talked to our first client about their storage needs and it does require some effort on their part to think about who is going to use the data, and how it needs to be moved through various levels of access and storage. This isn't anything new; people have been thinking about these data lifecycle systems for a while, and there are products to help. But I don't' know how much thought has been given to it in the life sciences, and next-gen sequencing is new. So there's an opportunity to apply a lot of things from technology in other industries to this problem.
What kind of data come off the Analysis Engine?
Stoops: In the interests of supporting standard formats, we've adopted the Short-Read Format as the HeliScope output. This format was developed by a workgroup started by people in the research and genome center community, including Sanger, as well as vendors of the short-read platforms, and is gaining adoption. It's been adopted as the standard for the 1000 genomes project. NCBI has adopted it as the submission format for the short read archive. It all goes back to our belief in openness, and our open-source strategy, and open format where nothing is hidden from the user. Adopting the SRF standard fits very nicely with our strategy for openness.
Do you think potential customers have under-appreciated or are ill-prepared for the data deluge associated with next-gen platforms?
Stoops: [We've] heard cautionary notes from people who have experience with next-generation sequencers, effectively saying, “If you're not ready to accept the full challenge associated with analyzing the data, maybe you’d better think about having a genetic analysis service organization generate and do primary analysis of the data for you, because there really is a lot to take on.”
I don't mean to sound flippant here, but speed every step of the way in dealing with this is going to be key in mitigating the data management challenges. Speed with efficacy -- storing data, getting data off the instrument into a storage system, doing it in a manner that is lossless and not sacrificing speed for interruptions in data. Speed with which you can access the data. If you're a user, do you have the right data structures to support pulling out the data you need and just the data you need? Do you have the right tools to analyze the data and get the biology out of these sequences? Understanding just how much of the sequence data you need to accomplish your research goals but not overwhelm the whole data management pipeline.
We can look back to microarrays as an example of data analysis and data management becoming a bottleneck. That's something that is an even greater challenge with next-gen technologies because of the amount of data. And as fast as Moore's Law is working to support us, we're still eclipsing it in our ability to generate sequence data.