Helicos' IT Pipeline


April 1, 2008 | Kristen Stoops, Helicos' director of informatics business development, is building a federation of IT vendors to identify best practices that will help users manage the torrents of data they will be generating. Here she offers a glimpse of how those plans are progressing.

Bio•IT World: Why does the HeliScope require such an impressive computer server/Analysis Engine?
Stoops:
Number one, the HeliScope Single Molecule Sequencer is very image intensive in its technology for generating sequences. We'll produce about 5 Terabytes of image data per day, which presents a daunting challenge from a storage perspective, and the perspective of moving the data from the instrument to external storage. So one of the things we're doing is to limit the amount of data our customers themselves have to store...

[Two,] the HeliScope Analysis Engine is a very high-performance image analysis platform and server that does the image analysis on the fly (and will delete those images as part of the process), and stores a digital representation of those images in an object table, which is roughly 1/10 the size of the full image pack. Every object in every image is represented in the object table... We strip out all the background, we're not saving that, but we are saving every object that we detect, then when we do our base calling, we apply some smart algorithms to figure out if something is an artifact or a real base. From there, we go ahead and form actual sequence data...

We also store 1% diagnostic images that we will save and make available to customers. Those diagnostic images represent random sets of fields of view for each of our channels on our flow cells. So if one of our customers wanted to do their own image analysis, base calling, and strand formation, they could test their own algorithms against those diagnostic images, because it does represent a full stack of images...

The 5 TB data is a full image stack. So for a HeliScope Sequencer run that represents 7 days, that's 35 TB data. So we'll save about 1 percent, or 350 GB data, far more tractable from a storage and management standpoint... It will also allow us to troubleshoot anything that went on during the run that we didn't expect.

How much data can the Analysis Engine store? 
[There] is enough data storage for a minimum of two full runs worth of data. That's full runs, minus the full image stack - [i.e.] The diagnostic images, the 1 percent, the object table, the sequence data, and log files.... The data from that first run gets stored on the HeliScope Sequencer, and is completing processing while you start another run, and start saving that data onto the data store on the Analysis Engine. Before you start the third run, you take the data from the first run off the instrument, and delete that data from the Analysis Engine, thereby freeing up enough space to start the third run. So it makes it possible to run the instrument almost continuously...

How will you help customers manage all this data?
We're working with companies like BioTeam and GenomeQuest and a host of other vendors whose technologies can help mitigate the challenges of managing these data... This includes hardware vendors, networking technology, high-performance computing vendors, systems integrators, bioinformatics developers, and basic IT infrastructure companies, LIMS vendors, whose tools in an integrated way can support the pipeline needed to deal and analyze these large amounts of data...

Is your goal to name preferred solutions for these IT and computational needs?
This isn't about naming exclusive types of partners for our system. It's about validating this system... There are so many different factors in choosing the right storage environment. Today, one of the big drivers in choosing hardware is energy costs. You really have to understand the data lifecycle - how many people need to access data? How frequently? How quickly can the data be moved into persistent, less expensive storage, and then into archival storage mode?

What kind of data come off the Analysis Engine?
We've adopted the Short-Read Format as the HeliScope output. This format was developed by a workgroup started by people in the research and genome center community, including Sanger, as well as vendors of the short-read platforms, and is gaining adoption. It's been adopted as the standard for the 1000 Genomes project. NCBI has adopted it as the submission format for the short read archive. It all goes back to our belief in openness, and our open source strategy, and open format where nothing is hidden from the user.

Are potential customers ill prepared for the next-gen data deluge?
Speed every step of the way in dealing with this is going to be key in mitigating the data management challenges. Speed with efficacy - storing data, getting data off the instrument into a storage system, doing it in a manner that is lossless and not sacrificing speed for interruptions in data. Speed with which you can access the data. If you're a user, do you have the right data structures to support pulling out the data you need and just the data you need? Do you have the right tools to analyze the data and get the biology out of these sequences? Understanding just how much of the sequence data you need to accomplish your research goals but not overwhelm the whole data management pipeline.  --K.D.

Return to main article. 

 

___________________________________________________

 This article appeared in Bio-IT World Magazine.
Subscriptions are free for qualifying individuals.  
Apply Today.

Click here to login and leave a comment.  

0 Comments

Add Comment

Text Only 2000 character limit

Page 1 of 1

White Papers & Special Reports

definiens briefingon-76Next-Generation Technologies Revolutionizing Oncology and Diagnostics
underwritten by Definiens

This “Briefing On” collection of Bio-IT World features, commentaries and analysis, presents some of the latest thinking on high-throughput technologies that are being applied to the fields of research and drug discovery, with particular emphasis on oncology, diagnostics and imaging technologies. Download now at no charge compliments of the underwriting sponsor, Definiens. Download This Free Paper



gq nxt gen seq

This Bio•IT World Briefing On “Next-Generation Sequencing,” underwritten by GenomeQuest, Inc.,
presents a selection of feature stories, interviews,commentaries, conference reports, and editorials on the emergence, opportunities, and challenges posed by high-throughput sequencing. Covered in this collection: the launch of new platforms from Applied Biosystems and Helicos; new applications of nextgen sequencing; the rise of personal genomics; and informatics solutions to vexing problem of managing the vast volumes of next-gen data.  Download now 



metaminerMetaMiner™ Cystic Fibrosis Report
Sponsored by GeneGo

This paper discusses the MetaMiner™ (CF) data analysis platform for a broad range of CF researchers designed to:

1. Easily assemble important biological and chemical experimental data available today in cystic fibrosis research.
2. Visualize key mechanisms leading to the disease through pathway maps and network models.
3. Provide the CF community a “one stop shop” tool for uploading and analyzing experimental data in a disease-centered interface.
Download Now

 



Life Science Webcasts & Podcasts

GenoLogicsgenologics 2 translational
Enabling Translational Research Informatics

Learn about the challenges facing life sciences research labs to manage their translational research data:

  • The trends for organizations to adopt informatics solutions for translational research.
  • The unique requirements with managing complex data and workflow.
  • What labs should consider when reviewing informatics solutions for translational research.
  • Which life sciences research organizations are successfully adopting an informatics solution.

Download Now



More Podcasts

Job Openings

Assistant Editor (Science Writer)~Cambridge Healthtech Institute (CHI), Needham, MA, 
Cambridge Healthtech Institute seeks an assistant editor (science writer) who is an ambitious, dependable journalist who can fulfill a range of writing and editorial duties for a series of eNewsletters covering various aspects of the biopharmaceutical industry in addition to CHI’s flagship publication, Bio-IT World magazine.  This is a superb opportunity to make important contributions to the growth and success of a multimedia science publishing group, while gaining invaluable experience in multiple facets of the publishing industry.   Interested candidates should submit a cover letter, including 3 writing samples (attached in Word or PDF format), salary history or requirements, and resume to kdavies@healthtech.com. For a detailed description of the Assistant Editor position, please click here.

Fred Hutchinson Cancer Research Center: IT Business Analyst III
The Hutchinson Center is the only National Cancer Institute-designated comprehensive cancer center in the Pacific Northwest. Through our Tumor Research Initiative, we are finding new ways to detect tumors at an early stage.  We are presently seeking an experienced IT Business Analyst to assess technology needs for the Tumor Research Initiative, and to identify and design improvements to computer based systems.  For more information please visit www.fhcrc.org and search for Job# AD-21465

For reprints and/or copyright permission, please contact RMS, 1808 Colonial Village Lane, Lancaster, PA;

(717) 399-1900 ext 100 or via email to bio-itworld@theygsgroup.com.