Helicos' Kristen Stoops on Next-Generation Data Management



Loading...

March 12, 2008 | Last week, Helicos BioSciences announced that it had shipped its first $1.35 million single-molecule DNA sequencing instrument to a company called Expression Analysis in North Carolina. Kristen Stoops, Helicos' director of bioinformatics, is building a federation of IT vendors to identify best practices that will help users manage the inevitable torrents of data they will be generating. She offered Bio-IT World’s Kevin Davies a glimpse of how those plans are progressing.

Bio-IT World: The HeliScope comes with an impressive computer server or Analysis Engine. Why did you decide to go that route?

Stoops: There were a couple of motivations behind providing such a powerful box for image analysis and primary analysis of the data. Number one, the HeliScope Single Molecule Sequencer is very image-intensive in its technology for generating sequences. We'll produce about 5 terabytes (TB) of image data per day, which presents a daunting challenge from a storage perspective, and the perspective of moving the data from the instrument to external storage. So one of the things we're doing is to limit the amount of data our customers themselves have to store…

[Second,] the HeliScope Analysis Engine is a very high-performance image analysis platform and server that does the image analysis on the fly (and will delete those images as part of the process), and store a digital representation of those images in an object table, which is roughly 1/10 the size of the full image pack.

Every object in every image is represented in the object table. If you wanted to, you could recreate the images from the object table... We strip out all the background, we're not saving that, but we are saving every object that we detect, then when we do our base calling, we apply some smart algorithms to figure out if something is an artifact or a real base. From there, we go ahead and form actual sequence data...

We also store one percent of diagnostic images that we will save and make available to customers. Those diagnostic images represent random sets of fields of view for each of our channels on our flow cells. So if one of our customers wanted to do their own image analysis, base calling and strand formation, they could test their own algorithms against those diagnostic images, because it does represent a full stack of images from between 10 and 20 fields of view per channel.

The 5 TB data is a full image stack. So for a HeliScope Sequencer run that represents seven days, that's 35 TB data. So we'll save about one percent, or 350 GB data, far more tractable from a storage and management standpoint than something that is two orders of magnitude higher... It will also allow us to troubleshoot anything that went on during the run that we didn't expect.

How much data can the Analysis Engine store?

Stoops: Another efficiency we've tried to build into the Analysis Engine is enough data storage for a minimum of two full runs worth of data. That's full runs, minus the full image stack -- [i.e.] The diagnostic images, the one percent, the object table, the sequence data, and log files. That's the information that actually gets stored on the Analysis Engine on a run-by-run basis that customers have access to...

We want people to be able to run the instrument with end-to-end runs, one right after another. So if you start a run, the data from that first run gets stored on the HeliScope Sequencer, and is completing processing while you start another run, and start saving that data onto the data store on the Analysis Engine. Before you start the third run, you take the data from the first run off the instrument, and delete that date from the Analysis Engine, thereby freeing up enough space to start the third run. So it makes it possible to run the instrument almost continuously...

There is a few hours of strand formation after the end of a run... Then you've got to pull the data off the HeliScope Instrument. That takes on the order of hours -- it is a lot of data.

How will you be helping customers manage all this data?

Stoops: We believe there are a lot of resources in the commercial and academic community that, if brought together effectively and in a tailored way for our system, can provide a lot of value. So we're working with companies like BioTeam and GenomeQuest and a host of other vendors whose technologies can help mitigate the challenges of managing these data.

One of my challenges is to bring together a partner program of vendors whose technologies will help mitigate these challenges. This includes hardware vendors, networking technology, high-performance computing vendors, systems integrators, bioinformatics developers, and basic IT infrastructure companies, LIMS vendors, whose tools in an integrated way can support the pipeline needed to deal and analyze these large amounts of data...

I believe that to go out and just partner ad hoc and present those individual solutions is going to be a lot less helpful than working with a core group of best-of-breed partners, who integrate their technologies together to form a pipeline, which can then be validated at Helicos in the context of our own research collaborations. Then we can take that solution -- a full HeliScope IT pipeline -- from these various vendors out to the community as a model for what we know works from our own internal research...

It isn't intended to be an exclusive kind of partner program at all. We'd welcome other vendors coming in, and would work with them to develop other models that might suit different types of customers. It's really one step beyond the solution-based partnerships that one sees in the life sciences/IT space. But one that is a solution around an entire instrument... If a customer had to do this on their own, it would be very time-consuming, very expensive, and probably at the expense of running the instrument to do their own research. It makes much more sense for us to do this and to be able to make those recommendations to customers.

How are those vendor partner discussions going?

Stoops: Those conversations have been around understanding the capabilities of their technology and how well suited they are to what we want to do. [I'm] presenting this idea to them, assessing their interest... It's pretty self-evident that the market is there and there is a business opportunity associated with next-gen platforms. So that part of partnering has already taken care of itself. But [I'm] gauging their buy-in and trying to understand which partners would work best together... I've talked to a number of partners in all these different areas, and have a pretty good list in mind. Once we have instruments in house running research collaborations, we can really put this to work.

Is your goal to name preferred partners or solutions for these IT and computational needs?

Stoops: This isn't about naming exclusive types of partners for our system. It's about validating this system, and there may be others. There will always be advances in technology that we'll want to keep abreast of. There are so many different factors in choosing the right storage environment. Today, one of the big drivers in choosing hardware are energy costs. You really have to understand the data lifecycle -- how many people need to access data? How frequently? How quickly can the data be moved into persistent, less-expensive storage, and then into archival storage mode?

I've recently talked to our first client about their storage needs and it does require some effort on their part to think about who is going to use the data, and how it needs to be moved through various levels of access and storage. This isn't anything new; people have been thinking about these data lifecycle systems for a while, and there are products to help. But I don't' know how much thought has been given to it in the life sciences, and next-gen sequencing is new. So there's an opportunity to apply a lot of things from technology in other industries to this problem.

What kind of data come off the Analysis Engine?

Stoops: In the interests of supporting standard formats, we've adopted the Short-Read Format as the HeliScope output. This format was developed by a workgroup started by people in the research and genome center community, including Sanger, as well as vendors of the short-read platforms, and is gaining adoption. It's been adopted as the standard for the 1000 genomes project. NCBI has adopted it as the submission format for the short read archive. It all goes back to our belief in openness, and our open-source strategy, and open format where nothing is hidden from the user. Adopting the SRF standard fits very nicely with our strategy for openness.

Do you think potential customers have under-appreciated or are ill-prepared for the data deluge associated with next-gen platforms?

Stoops: [We've] heard cautionary notes from people who have experience with next-generation sequencers, effectively saying, “If you're not ready to accept the full challenge associated with analyzing the data, maybe you’d better think about having a genetic analysis service organization generate and do primary analysis of the data for you, because there really is a lot to take on.”

I don't mean to sound flippant here, but speed every step of the way in dealing with this is going to be key in mitigating the data management challenges. Speed with efficacy -- storing data, getting data off the instrument into a storage system, doing it in a manner that is lossless and not sacrificing speed for interruptions in data. Speed with which you can access the data. If you're a user, do you have the right data structures to support pulling out the data you need and just the data you need? Do you have the right tools to analyze the data and get the biology out of these sequences? Understanding just how much of the sequence data you need to accomplish your research goals but not overwhelm the whole data management pipeline.

We can look back to microarrays as an example of data analysis and data management becoming a bottleneck. That's something that is an even greater challenge with next-gen technologies because of the amount of data. And as fast as Moore's Law is working to support us, we're still eclipsing it in our ability to generate sequence data.

Click here to login and leave a comment.  

0 Comments

Add Comment

Text Only 2000 character limit

Page 1 of 1



White Papers & Special Reports

oracle20723
The Role of Analytics in Transforming Healthcare
Sponsored by Oracle

Sharing many of the data challenges and opportunities faced by Healthcare, the Life Sciences industry remains focused on delivering new, innovative therapies and solutions to patients in a cost effective, timely and safe way. With spiraling R&D costs, new methods such as adaptive trials, and never ending need for deep pharmacovigilance, the Life Sciences companies that effectively use analytics to explore, monitor and optimize their business will rapidly become the new leaders.

Oracle’s strategy—built upon Enterprise Health Analytics and Health Data Warehouse Foundation—provides a powerful, practical, and extensible approach to delivering the IT analytics infrastructure required to confront the worldwide healthcare challenge.



pegasystems
BPM-Based Case Management Approach to Optimizing Clinical Trial Efficiency
Sponsored by Pegasystems

Business Process Management (BPM) software offers liberation in the planning and management of clinical trials today. SmartBPM provides the components for automating critical clinical trial processes ranging from protocol development and patient enrollment to site management and investigator payments. Advantages are:

  • Potentially stunning return on investment at multiple levels.
  • A 500%, or better, increase in application development time by directly executing business requirements
  • Improved customer retention
  • A 50% possible reduction in training time

Discovered is opportunity to enhance relationships with investigators, subjects, and regulators while bringing momentum to a technology-impaired study startup phase. Learn more about SmartBPM in this complimentary white paper.



Cmed paper
Next-gen Cloud-based eClinical
Sponsored by Cmed Technology

New technologies are available to leverage Cloud Computing in  managing clinical trial data. This paper discusses a next generation eClinical
platform that:

  • Speeds trial set up
  • Accommodates changes with zero downtime
  • Integrates effectively with other clinical trial technology systems

It is offered with either software-as-a-service (SaaS), or turnkey infrastructure options in which the user organization operates their own cloud using their IT teams, within their data centers. Read this paper to learn and decide how best to leverage cloud computing’s many strengths for your organization’s  particular needs.



Job Openings

mskc logo
Software Engineer – Computational Biology Center

Memorial Sloan-Kettering Cancer Center seeks an Engineer to design and develop complex data analysis systems in support of cancer genomics research projects at the Computational Biology Center. Qualified candidate will have a BA, 5+ years of software development experience and expert knowledge of Java, SQL, and HTML.

Apply: www.mskcciscareers.org.  Equal opportunity and affirmative action employer.

Web Symposia
Loading...

Bio-IT World proudly presents the Bio-IT World Web Symposia Series covering a broad array of topics within the life sciences and drug development enterprise.

Leveraging BPM to Increase Efficiencies in Clinical Trial Case Management
August 3, 2010 | 1:00 – 2:30 p.m. EST
Sponsored by: Pegasystems
Program Details | Register Here 

 


Loading...

For reprints and/or copyright permission, please contact The YGS Group, 3650 West Market Street, York, PA;

(717) 505-9701 ext. 125, or via email to Ashley.Zander@theYGSgroup.com.