Next-Generation Sequencing Problems and Solutions



By William Van Etten
Loading...

"All that you are is either protein or the result of protein action.”  — J. Craig Venter

Oct. 10, 2007 |  I’m not sure whether Venter said this first, but I heard it first from him, and it’s still one of my favorites quotes. The deeper meaning is this: the uniqueness of an individual can be reduced to their unique set of proteins, encoded within their genome, and is located within the estimated 0.1 to 0.5% differences between any two genomes. The Human Genome Project identified most of these differences, but it leaves me wanting to know “my” differences, my “personal genome.” And I’m not alone in this view. This year, we’ve witnessed the personal genome sequencing of both Venter and James Watson, and startups that hope to capitalize on the business of personal genomics, such as Navigenics (see First Base) and 23andMe.

Sequencing the first human genome required ten years at a cost of around $1 billion. Much of this time and money was devoted to preliminary science that doesn’t need to be repeated in order to sequence subsequent human genomes. Hence sequencing another individual human genome using conventional Sanger sequencers might require 10,000 instrument days (30 instruments for 1 year) for around $10 million. Venter’s personal genome was sequenced in this way. He sequenced his diploid genome (requiring double the effort), but borrowed some of his sequence from the Celera genome project, so this probably required something less than 60 instrument years and $20 million. In contrast, Jim Watson’s genome was sequenced by 454 in two months for $2 million. The cost of a personal genome should drop below $100,000 within a year.

Each next-generation sequencing vendor has its niche (e.g longer reads, greater accuracy, greater data density, etc), but they all offer 100 to 1000-fold improvements in throughput and cost over Sanger sequencing. How will the ability to generate so much more sequence so cheaply affect the sequencing market, and what are the IT implications for analyzing all this data?

Overall, the market trend is a lot more sequencing being performed by many more, smaller research groups. Assays previously performed using mass specs and microarrays are now being performed using next-gen sequencing. Beyond individual reference genomes, we’re resequencing intra-species variants for phylogenetic, association, and population studies. Clinical researchers are thinking about how to include a personal genome within a patient’s electronic medical record, permitting a physician’s diagnosis and annotation.

IT Implications
The BioTeam has had the opportunity to interact in varying degrees with each of the next-gen sequencing vendors, and with several researchers utilizing the vendors’ instruments. The storage and computing requirements for each vendor’s instrument are unique, but can be generalized to 0.2-6 terabytes (TB) of primary image data generated daily and a minimum of 4-8 CPUs in continuous operation to perform image analysis, base calling, and sequence assembly in order to keep up with data production.

The vendors’ solution to this problem is to recommend (or include with their product) a Windows or Linux workstation to drive the instrument with a large local disk to collect the image data. These data are subsequently copied to a shared network file server for analysis by a small (~4 node) UNIX cluster. The advantage of a small cluster with large network storage is that it is a well-established architecture for providing many processors working together to analyze a common set of data. The disadvantages are that this particular problem is more IO-intensive than CPU-intensive (a bottle-neck for clusters) and that even though clustering has become easier over the years, configuring and maintaining the dependent shared network services is still beyond the abilities or duties of a biologists and hospital technicians.

We have participated in the deployment of hundreds of UNIX clusters for many different compute-intensive research problems, helping to make clustering easier for researchers who don’t know how (or don’t want to know). But is a small cluster and shared network file server the most suitable solution for this particular problem? Rather than simply throw the solution we know at the problem, we re-examine the requirements and propose a better-suited alternative.

User: The average biologist or hospital technician knows their research, knows computing from the perspective of desktop productivity tools, but does not have IT experience. Small research labs probably don’t have support of dedicated IT staff.

Processing: Image analysis, base calling and sequence assembly require between 4 and 8 of today’s modern processors in continuous operation to keep up with data production.

Storage: Moving terabytes of image data from one disk to another over a gigabit TCP/IP network using a modest network file server takes hours, even days. Moving that much data through a direct connection on the same computer takes minutes to hours.

Data Archival: Although disks are cheap, DNA sequencing is cheaper. It might cost less to sequence a genome again than to archive the primary data from sequencing it the first time.

Data Management: The results generated by next-gen sequencers are massive. In order to become useful information, these results must be related and re-related by researchers to answer whatever questions they posed by the experiment. This is clearly the most challenging and unaddressed issue.

Alternative Solution
The ideal compute and storage engine to support a next-gen sequencer is: 1) easy enough for a biologist or hospital technician to operate and manage; 2) provides 4 to 8 processors; 3) provides 10-20 TB of fast disk; 4) and the disk presents itself directly to both the instrument driving workstation and the compute engine without copying the data.

So we thought... let’s replace the 4-node cluster with a single 8-core system. This provides the necessary compute power, but gets rid of all that clustering business. Then replace the shared network file server with a direct-attached fiber channel disk array. This provides the same storage capacity but much faster IO performance. Finally, move the independent Windows/Linux instrument driving system to a virtual machine operating on the same system. This permits the instrument to write to the same disk that the compute engine reads from without copying the data. The disadvantages are that we can’t easily add more processing power if we want to, and the CPU/IO of data analysis might hinder data gathering in the VM layer.

For our validation experiment, we used an Apple Mac Pro with dual 3-GHz Quad-core Intel Xeon processors and 8 GB of DDR2 RAM, operating Mac OS X 10.4, attached by Quad-Channel 4-Gb Fiber Channel to 10.5-TB Apple Xserve RAID configured with RAID level 5 mirroring. With this hardware configuration, we used Illumina’s data analysis pipeline (version 0.2.2.5) to analyze a small run consisting of a sequenced bacterial artificial chromosome including 68,000 images. Under these conditions, the analysis execution time was around 4 hours — the same as a small cluster with network disk. We captured a disk image of Illumina’s Galaxie workstation and demonstrated that we could run it within a Parallels VM layer. (We did not attempt generating new sequence while analyzing prior sequence.)

Over the next few years, a lot more sequencing will be done by smaller groups that won’t have dedicated IT staff to manage clusters and file servers. A single 8-core system with fast, direct disk may be adequate and perhaps better suited to the problem.

Email William Van Etten at bill@bioteam.net.

Subscribe to Bio-IT World  magazine.

Click here to login and leave a comment.  

0 Comments

Add Comment

Text Only 2000 character limit

Page 1 of 1

White Papers & Special Reports

Quantum
StorNext 4.0: Technical Product Brief
Sponsored by Quantum

 
Proven in the world’s most data intensive industries, Quantum StorNext is a scalable, high-performance file system which allows data sharing across Linux, Mac, Unix, and Windows operating systems and manages data in enterprise storage environments. In this Technical Brief you'll learn:

  • How a high-performing file system can accelerate your business
  • How to simplify your data management
  • How a tiered storage approach can save you money


SURETY-IP_WPx108
Protect Your Scientific Intellectual Property: Proof of Lab Informatics Data Authenticity is Your Best Legal Defense
Sponsored by Surety, LLC

As a bio-technology or life sciences organization, your formulas, treatments and research and discoveries are the “lifeblood” of your business. But if you aren't protecting the integrity of your scientific data in your lab informatics systems, you risk losing IP ownership, revenue and consequently your business if you can't prove time-of-creation and data authenticity. Learn how you can implement simple, cost-effective and automated controls to protect your scientific intellectual property. Consider:

  • IP protection requirements in bio-pharma and other science-oriented industries can extend out 20, 30, 40 or more years
  • Most electronic lab management solutions include generic authenticity controls, so how "legally defensible" is yours?
  • Only standards-compliant, independent controls can future-proof your approach to long-term IP integrity protection and authenticity.
  • Learn more - get the free whitepaper now


BlueArc_WP_DataMigration.jpg
The Key to Life Sciences Data Management: Transparent Migration
Sponsored by BlueArc

Life sciences organizations face new data management challenges as the volume of research data grows and more data is kept online for longer times. Read this paper to learn about:

  • The benefits of transparent data migration (TDM)
  • How TDM technologies can simplify data management.
  • How using TDM can help increase storage utilization, improve computational workflow performance, and optimize the use of storage resources.


Life Science Webcasts & Podcasts

adobe_i3_btn_webinarNext-Generation Clinical Trial and Data Management Applications
Sponsored by Adobe

This webinar introduces i3Cube - a web-based, fully integrated, clinical trial and data management system built on Adobe’s LiveCycle® Enterprise Suite.  I3 cube provides end-to-end automation that delivers unprecedented visibility into information that sponsors need to accelerate the study process and complete trials efficiently. Viewers will learn more about:

  • Creating faster and more efficient trial processes
  • Reducing investigator burden 
  • Real-time sponsor transparency into study information
  • Enterprise solutions based on Adobe LiveCycle® ES utilizing cross-platform clients of Reader, Flash and AIR

    Download now.



More Podcasts

Job Openings

Employers -- Don't miss this opportunity to reach well-qualified life science candidates.

Loading...

For reprints and/or copyright permission, please contact The YGS Group, 3650 West Market Street, York, PA;

(717) 505-9701 ext. 125, or via email to Ashley.Zander@theYGSgroup.com.