Next-Generation Sequencing Problems and Solutions



By William Van Etten

"All that you are is either protein or the result of protein action.”  — J. Craig Venter

Oct. 10, 2007 |  I’m not sure whether Venter said this first, but I heard it first from him, and it’s still one of my favorites quotes. The deeper meaning is this: the uniqueness of an individual can be reduced to their unique set of proteins, encoded within their genome, and is located within the estimated 0.1 to 0.5% differences between any two genomes. The Human Genome Project identified most of these differences, but it leaves me wanting to know “my” differences, my “personal genome.” And I’m not alone in this view. This year, we’ve witnessed the personal genome sequencing of both Venter and James Watson, and startups that hope to capitalize on the business of personal genomics, such as Navigenics (see First Base) and 23andMe.

Sequencing the first human genome required ten years at a cost of around $1 billion. Much of this time and money was devoted to preliminary science that doesn’t need to be repeated in order to sequence subsequent human genomes. Hence sequencing another individual human genome using conventional Sanger sequencers might require 10,000 instrument days (30 instruments for 1 year) for around $10 million. Venter’s personal genome was sequenced in this way. He sequenced his diploid genome (requiring double the effort), but borrowed some of his sequence from the Celera genome project, so this probably required something less than 60 instrument years and $20 million. In contrast, Jim Watson’s genome was sequenced by 454 in two months for $2 million. The cost of a personal genome should drop below $100,000 within a year.

Each next-generation sequencing vendor has its niche (e.g longer reads, greater accuracy, greater data density, etc), but they all offer 100 to 1000-fold improvements in throughput and cost over Sanger sequencing. How will the ability to generate so much more sequence so cheaply affect the sequencing market, and what are the IT implications for analyzing all this data?

Overall, the market trend is a lot more sequencing being performed by many more, smaller research groups. Assays previously performed using mass specs and microarrays are now being performed using next-gen sequencing. Beyond individual reference genomes, we’re resequencing intra-species variants for phylogenetic, association, and population studies. Clinical researchers are thinking about how to include a personal genome within a patient’s electronic medical record, permitting a physician’s diagnosis and annotation.

IT Implications
The BioTeam has had the opportunity to interact in varying degrees with each of the next-gen sequencing vendors, and with several researchers utilizing the vendors’ instruments. The storage and computing requirements for each vendor’s instrument are unique, but can be generalized to 0.2-6 terabytes (TB) of primary image data generated daily and a minimum of 4-8 CPUs in continuous operation to perform image analysis, base calling, and sequence assembly in order to keep up with data production.

The vendors’ solution to this problem is to recommend (or include with their product) a Windows or Linux workstation to drive the instrument with a large local disk to collect the image data. These data are subsequently copied to a shared network file server for analysis by a small (~4 node) UNIX cluster. The advantage of a small cluster with large network storage is that it is a well-established architecture for providing many processors working together to analyze a common set of data. The disadvantages are that this particular problem is more IO-intensive than CPU-intensive (a bottle-neck for clusters) and that even though clustering has become easier over the years, configuring and maintaining the dependent shared network services is still beyond the abilities or duties of a biologists and hospital technicians.

We have participated in the deployment of hundreds of UNIX clusters for many different compute-intensive research problems, helping to make clustering easier for researchers who don’t know how (or don’t want to know). But is a small cluster and shared network file server the most suitable solution for this particular problem? Rather than simply throw the solution we know at the problem, we re-examine the requirements and propose a better-suited alternative.

User: The average biologist or hospital technician knows their research, knows computing from the perspective of desktop productivity tools, but does not have IT experience. Small research labs probably don’t have support of dedicated IT staff.

Processing: Image analysis, base calling and sequence assembly require between 4 and 8 of today’s modern processors in continuous operation to keep up with data production.

Storage: Moving terabytes of image data from one disk to another over a gigabit TCP/IP network using a modest network file server takes hours, even days. Moving that much data through a direct connection on the same computer takes minutes to hours.

Data Archival: Although disks are cheap, DNA sequencing is cheaper. It might cost less to sequence a genome again than to archive the primary data from sequencing it the first time.

Data Management: The results generated by next-gen sequencers are massive. In order to become useful information, these results must be related and re-related by researchers to answer whatever questions they posed by the experiment. This is clearly the most challenging and unaddressed issue.

Alternative Solution
The ideal compute and storage engine to support a next-gen sequencer is: 1) easy enough for a biologist or hospital technician to operate and manage; 2) provides 4 to 8 processors; 3) provides 10-20 TB of fast disk; 4) and the disk presents itself directly to both the instrument driving workstation and the compute engine without copying the data.

So we thought... let’s replace the 4-node cluster with a single 8-core system. This provides the necessary compute power, but gets rid of all that clustering business. Then replace the shared network file server with a direct-attached fiber channel disk array. This provides the same storage capacity but much faster IO performance. Finally, move the independent Windows/Linux instrument driving system to a virtual machine operating on the same system. This permits the instrument to write to the same disk that the compute engine reads from without copying the data. The disadvantages are that we can’t easily add more processing power if we want to, and the CPU/IO of data analysis might hinder data gathering in the VM layer.

For our validation experiment, we used an Apple Mac Pro with dual 3-GHz Quad-core Intel Xeon processors and 8 GB of DDR2 RAM, operating Mac OS X 10.4, attached by Quad-Channel 4-Gb Fiber Channel to 10.5-TB Apple Xserve RAID configured with RAID level 5 mirroring. With this hardware configuration, we used Illumina’s data analysis pipeline (version 0.2.2.5) to analyze a small run consisting of a sequenced bacterial artificial chromosome including 68,000 images. Under these conditions, the analysis execution time was around 4 hours — the same as a small cluster with network disk. We captured a disk image of Illumina’s Galaxie workstation and demonstrated that we could run it within a Parallels VM layer. (We did not attempt generating new sequence while analyzing prior sequence.)

Over the next few years, a lot more sequencing will be done by smaller groups that won’t have dedicated IT staff to manage clusters and file servers. A single 8-core system with fast, direct disk may be adequate and perhaps better suited to the problem.

Email William Van Etten at bill@bioteam.net.

Subscribe to Bio-IT World  magazine.

Click here to login and leave a comment.  

0 Comments

Add Comment

Text Only 2000 character limit

Page 1 of 1



White Papers & Special Reports

sgi - whp 1
Turning Genomics Data into Practical Insight
Sponsored by SGI

With worldwide sequencing capacity approaching 13 quadrillion DNA bases annually turning genomics data into knowledge is a true computational challenge. Read this paper and learn how the SGI UV coherent shared memory platform can:  

  • Speed results time while cost competitively tackling the most difficult computational problems across all omics disciplines. 
  • Push performance by scaling to extraordinary levels, up to 256 sockets (2,560 cores, 4,096 threads) per single system (one OS image). 

Provide support for up to 16TB of coherent shared memory in a single system image enabling extreme efficiency across a wide range of compute demands. 



accerlys-logo_2012_wh
New Complimentary Market Survey…
Collaborations and Communications Within Drug Discovery Research
Sponsored by Accelrys
This survey was conducted by the Cambridge Healthtech Media Group in January, 2012. It was sponsored by Accelrys related to their HEOS initiative to gather valid information around externalizing collaborative research while improving communications in the cloud. With 310 qualified industry respondents the survey findings reveal useful usage and trends patterns.  An insightful follow-on discussion and webinar related to this survey, and the HEOS by Scynexis SaaS portal is also available on the Bio-IT World website for complementary viewing.
 


Job Openings

tessella logo 
Scientific Software Engineer
Boston MA
$70,000 to $95,000
 

Tessella delivers software engineering and consulting services to leading pharmaceutical and biotech companies. We are recruiting Software Engineersto work with skilled bioinformaticians and scientists to identify business needs and recommend and develop technical solutions. Applicants require BS, MS or PhD in bioinformatics, biology or chemistry and 2+ years of software development in either: Java, C#, C++, C or VB.NET. 

Apply at http://jobs.tessella.com   

 

oxford nanopore logo 


 Early Access Collaborations Managers
Oxford Nanopore Technologies is developing a novel technology, GridIONTM for the direct, electronic analysis of DNA/RNA and other analytes.  As the system approaches the market, we are building a team of technically knowledgeable, highly motivated candidates with excellent customer service and facilitation skills to join our company as Collaboration Managers.  This is a unique opportunity to work with world-leading genomics customers throughout the early adoption phase of a new generation of DNA sequencing technology.. This is a facilitative, enabling role with responsibility for managing technology development collaborations with key customers at leading genomics institutions.  It will include long term management of the collaboration plan and milestones and associated meetings and documentation. Click here to find out more and apply   

Oxford Nanopore's GridION technology, VP, Sales and Marketing Oxford Nanopore Technologies is a fast-moving technology company that is developing a novel electronic molecular analysis technology. The technology is adaptable for the analysis of DNA/RNA, proteins, chemicals and other molecules.  It is therefore suitable for use in a variety of markets including scientific research and clinical applications.  As the technology approaches the market, Oxford Nanopore is seeking a visionary VP of sales and marketing to join the senior team.  The candidate will embrace the opportunities afforded by entering the market with a truly disruptive technology that has the potential to expand the number of users and the variety of applications in each target market.  This is a rare opportunity to influence the commercial strategy at an early phase of its commercial lifetime, in a well funded company.  Oxford Nanopore welcomes applications from candidates with a track record of high-level strategic commercial  leadership, who wish to apply a fresh approach to existing markets.  Experience in Life Sciences/DNA sequencing is central to this role, however we will consider your application if you have experience of disruptive technologies in other related industries.  We are particularly interested in candidates with strong expertise in the use of digital technologies for sales and marketing of scientific/technical products.  Click to  Apply  


 

For reprints and/or copyright permission, please contact  Tim McLucas, (781) 972-1342, tmclucas@healthtech.com .