Helicos' Kristen Stoops on Next-Generation Data Management



March 12, 2008 | Last week, Helicos BioSciences announced that it had shipped its first $1.35 million single-molecule DNA sequencing instrument to a company called Expression Analysis in North Carolina. Kristen Stoops, Helicos' director of bioinformatics, is building a federation of IT vendors to identify best practices that will help users manage the inevitable torrents of data they will be generating. She offered Bio-IT World’s Kevin Davies a glimpse of how those plans are progressing.

Bio-IT World: The HeliScope comes with an impressive computer server or Analysis Engine. Why did you decide to go that route?

Stoops: There were a couple of motivations behind providing such a powerful box for image analysis and primary analysis of the data. Number one, the HeliScope Single Molecule Sequencer is very image-intensive in its technology for generating sequences. We'll produce about 5 terabytes (TB) of image data per day, which presents a daunting challenge from a storage perspective, and the perspective of moving the data from the instrument to external storage. So one of the things we're doing is to limit the amount of data our customers themselves have to store…

[Second,] the HeliScope Analysis Engine is a very high-performance image analysis platform and server that does the image analysis on the fly (and will delete those images as part of the process), and store a digital representation of those images in an object table, which is roughly 1/10 the size of the full image pack.

Every object in every image is represented in the object table. If you wanted to, you could recreate the images from the object table... We strip out all the background, we're not saving that, but we are saving every object that we detect, then when we do our base calling, we apply some smart algorithms to figure out if something is an artifact or a real base. From there, we go ahead and form actual sequence data...

We also store one percent of diagnostic images that we will save and make available to customers. Those diagnostic images represent random sets of fields of view for each of our channels on our flow cells. So if one of our customers wanted to do their own image analysis, base calling and strand formation, they could test their own algorithms against those diagnostic images, because it does represent a full stack of images from between 10 and 20 fields of view per channel.

The 5 TB data is a full image stack. So for a HeliScope Sequencer run that represents seven days, that's 35 TB data. So we'll save about one percent, or 350 GB data, far more tractable from a storage and management standpoint than something that is two orders of magnitude higher... It will also allow us to troubleshoot anything that went on during the run that we didn't expect.

How much data can the Analysis Engine store?

Stoops: Another efficiency we've tried to build into the Analysis Engine is enough data storage for a minimum of two full runs worth of data. That's full runs, minus the full image stack -- [i.e.] The diagnostic images, the one percent, the object table, the sequence data, and log files. That's the information that actually gets stored on the Analysis Engine on a run-by-run basis that customers have access to...

We want people to be able to run the instrument with end-to-end runs, one right after another. So if you start a run, the data from that first run gets stored on the HeliScope Sequencer, and is completing processing while you start another run, and start saving that data onto the data store on the Analysis Engine. Before you start the third run, you take the data from the first run off the instrument, and delete that date from the Analysis Engine, thereby freeing up enough space to start the third run. So it makes it possible to run the instrument almost continuously...

There is a few hours of strand formation after the end of a run... Then you've got to pull the data off the HeliScope Instrument. That takes on the order of hours -- it is a lot of data.

How will you be helping customers manage all this data?

Stoops: We believe there are a lot of resources in the commercial and academic community that, if brought together effectively and in a tailored way for our system, can provide a lot of value. So we're working with companies like BioTeam and GenomeQuest and a host of other vendors whose technologies can help mitigate the challenges of managing these data.

One of my challenges is to bring together a partner program of vendors whose technologies will help mitigate these challenges. This includes hardware vendors, networking technology, high-performance computing vendors, systems integrators, bioinformatics developers, and basic IT infrastructure companies, LIMS vendors, whose tools in an integrated way can support the pipeline needed to deal and analyze these large amounts of data...

I believe that to go out and just partner ad hoc and present those individual solutions is going to be a lot less helpful than working with a core group of best-of-breed partners, who integrate their technologies together to form a pipeline, which can then be validated at Helicos in the context of our own research collaborations. Then we can take that solution -- a full HeliScope IT pipeline -- from these various vendors out to the community as a model for what we know works from our own internal research...

It isn't intended to be an exclusive kind of partner program at all. We'd welcome other vendors coming in, and would work with them to develop other models that might suit different types of customers. It's really one step beyond the solution-based partnerships that one sees in the life sciences/IT space. But one that is a solution around an entire instrument... If a customer had to do this on their own, it would be very time-consuming, very expensive, and probably at the expense of running the instrument to do their own research. It makes much more sense for us to do this and to be able to make those recommendations to customers.

How are those vendor partner discussions going?

Stoops: Those conversations have been around understanding the capabilities of their technology and how well suited they are to what we want to do. [I'm] presenting this idea to them, assessing their interest... It's pretty self-evident that the market is there and there is a business opportunity associated with next-gen platforms. So that part of partnering has already taken care of itself. But [I'm] gauging their buy-in and trying to understand which partners would work best together... I've talked to a number of partners in all these different areas, and have a pretty good list in mind. Once we have instruments in house running research collaborations, we can really put this to work.

Is your goal to name preferred partners or solutions for these IT and computational needs?

Stoops: This isn't about naming exclusive types of partners for our system. It's about validating this system, and there may be others. There will always be advances in technology that we'll want to keep abreast of. There are so many different factors in choosing the right storage environment. Today, one of the big drivers in choosing hardware are energy costs. You really have to understand the data lifecycle -- how many people need to access data? How frequently? How quickly can the data be moved into persistent, less-expensive storage, and then into archival storage mode?

I've recently talked to our first client about their storage needs and it does require some effort on their part to think about who is going to use the data, and how it needs to be moved through various levels of access and storage. This isn't anything new; people have been thinking about these data lifecycle systems for a while, and there are products to help. But I don't' know how much thought has been given to it in the life sciences, and next-gen sequencing is new. So there's an opportunity to apply a lot of things from technology in other industries to this problem.

What kind of data come off the Analysis Engine?

Stoops: In the interests of supporting standard formats, we've adopted the Short-Read Format as the HeliScope output. This format was developed by a workgroup started by people in the research and genome center community, including Sanger, as well as vendors of the short-read platforms, and is gaining adoption. It's been adopted as the standard for the 1000 genomes project. NCBI has adopted it as the submission format for the short read archive. It all goes back to our belief in openness, and our open-source strategy, and open format where nothing is hidden from the user. Adopting the SRF standard fits very nicely with our strategy for openness.

Do you think potential customers have under-appreciated or are ill-prepared for the data deluge associated with next-gen platforms?

Stoops: [We've] heard cautionary notes from people who have experience with next-generation sequencers, effectively saying, “If you're not ready to accept the full challenge associated with analyzing the data, maybe you’d better think about having a genetic analysis service organization generate and do primary analysis of the data for you, because there really is a lot to take on.”

I don't mean to sound flippant here, but speed every step of the way in dealing with this is going to be key in mitigating the data management challenges. Speed with efficacy -- storing data, getting data off the instrument into a storage system, doing it in a manner that is lossless and not sacrificing speed for interruptions in data. Speed with which you can access the data. If you're a user, do you have the right data structures to support pulling out the data you need and just the data you need? Do you have the right tools to analyze the data and get the biology out of these sequences? Understanding just how much of the sequence data you need to accomplish your research goals but not overwhelm the whole data management pipeline.

We can look back to microarrays as an example of data analysis and data management becoming a bottleneck. That's something that is an even greater challenge with next-gen technologies because of the amount of data. And as fast as Moore's Law is working to support us, we're still eclipsing it in our ability to generate sequence data.

Click here to login and leave a comment.  

0 Comments

Add Comment

Text Only 2000 character limit

Page 1 of 1



White Papers & Special Reports

sgi whp 2
Managing the Modern Genomics Data Flood
Sponsored by SGI

Managing and storing the perfect storm of multi-disciplined data pouring from next generation sequencers and other omics instruments is a central challenge in life sciences. Discover in this paper how the SGI ArcFiniti storage solution, optimized for unstructured genomics and life sciences data can: 

  • Reduce costs, proactively protect data integrity, and deliver the high performance I/O required for genomics data processing and analysis.  
  • Effectively manage capacities from 156TB to 1.4PB as a disk based, integrated hardware and software platform 


sgi - whp 1
Turning Genomics Data into Practical Insight
Sponsored by SGI

With worldwide sequencing capacity approaching 13 quadrillion DNA bases annually turning genomics data into knowledge is a true computational challenge. Read this paper and learn how the SGI UV coherent shared memory platform can:  

  • Speed results time while cost competitively tackling the most difficult computational problems across all omics disciplines. 
  • Push performance by scaling to extraordinary levels, up to 256 sockets (2,560 cores, 4,096 threads) per single system (one OS image). 

Provide support for up to 16TB of coherent shared memory in a single system image enabling extreme efficiency across a wide range of compute demands. 



accerlys-logo_2012_wh
New Complimentary Market Survey…
Collaborations and Communications Within Drug Discovery Research
Sponsored by Accelrys
This survey was conducted by the Cambridge Healthtech Media Group in January, 2012. It was sponsored by Accelrys related to their HEOS initiative to gather valid information around externalizing collaborative research while improving communications in the cloud. With 310 qualified industry respondents the survey findings reveal useful usage and trends patterns.  An insightful follow-on discussion and webinar related to this survey, and the HEOS by Scynexis SaaS portal is also available on the Bio-IT World website for complementary viewing.
 


Job Openings

tessella logo 
Scientific Software Engineer
Boston MA
$70,000 to $95,000
 

Tessella delivers software engineering and consulting services to leading pharmaceutical and biotech companies. We are recruiting Software Engineersto work with skilled bioinformaticians and scientists to identify business needs and recommend and develop technical solutions. Applicants require BS, MS or PhD in bioinformatics, biology or chemistry and 2+ years of software development in either: Java, C#, C++, C or VB.NET. 

Apply at http://jobs.tessella.com   

 

oxford nanopore logo 


 Early Access Collaborations Managers
Oxford Nanopore Technologies is developing a novel technology, GridIONTM for the direct, electronic analysis of DNA/RNA and other analytes.  As the system approaches the market, we are building a team of technically knowledgeable, highly motivated candidates with excellent customer service and facilitation skills to join our company as Collaboration Managers.  This is a unique opportunity to work with world-leading genomics customers throughout the early adoption phase of a new generation of DNA sequencing technology.. This is a facilitative, enabling role with responsibility for managing technology development collaborations with key customers at leading genomics institutions.  It will include long term management of the collaboration plan and milestones and associated meetings and documentation. Click here to find out more and apply   

Oxford Nanopore's GridION technology, VP, Sales and Marketing Oxford Nanopore Technologies is a fast-moving technology company that is developing a novel electronic molecular analysis technology. The technology is adaptable for the analysis of DNA/RNA, proteins, chemicals and other molecules.  It is therefore suitable for use in a variety of markets including scientific research and clinical applications.  As the technology approaches the market, Oxford Nanopore is seeking a visionary VP of sales and marketing to join the senior team.  The candidate will embrace the opportunities afforded by entering the market with a truly disruptive technology that has the potential to expand the number of users and the variety of applications in each target market.  This is a rare opportunity to influence the commercial strategy at an early phase of its commercial lifetime, in a well funded company.  Oxford Nanopore welcomes applications from candidates with a track record of high-level strategic commercial  leadership, who wish to apply a fresh approach to existing markets.  Experience in Life Sciences/DNA sequencing is central to this role, however we will consider your application if you have experience of disruptive technologies in other related industries.  We are particularly interested in candidates with strong expertise in the use of digital technologies for sales and marketing of scientific/technical products.  Click to  Apply  


 

For reprints and/or copyright permission, please contact  Tim McLucas, (781) 972-1342, tmclucas@healthtech.com .