"All that you are is either protein or the result of protein action.” — J. Craig Venter
Oct. 10, 2007 | I’m not sure whether Venter said this first, but I heard it first from him, and it’s still one of my favorites quotes. The deeper meaning is this: the uniqueness of an individual can be reduced to their unique set of proteins, encoded within their genome, and is located within the estimated 0.1 to 0.5% differences between any two genomes. The Human Genome Project identified most of these differences, but it leaves me wanting to know “my” differences, my “personal genome.” And I’m not alone in this view. This year, we’ve witnessed the personal genome sequencing of both Venter and James Watson, and startups that hope to capitalize on the business of personal genomics, such as Navigenics (see First Base) and 23andMe.
Sequencing the first human genome required ten years at a cost of around $1 billion. Much of this time and money was devoted to preliminary science that doesn’t need to be repeated in order to sequence subsequent human genomes. Hence sequencing another individual human genome using conventional Sanger sequencers might require 10,000 instrument days (30 instruments for 1 year) for around $10 million. Venter’s personal genome was sequenced in this way. He sequenced his diploid genome (requiring double the effort), but borrowed some of his sequence from the Celera genome project, so this probably required something less than 60 instrument years and $20 million. In contrast, Jim Watson’s genome was sequenced by 454 in two months for $2 million. The cost of a personal genome should drop below $100,000 within a year.
Each next-generation sequencing vendor has its niche (e.g longer reads, greater accuracy, greater data density, etc), but they all offer 100 to 1000-fold improvements in throughput and cost over Sanger sequencing. How will the ability to generate so much more sequence so cheaply affect the sequencing market, and what are the IT implications for analyzing all this data?
Overall, the market trend is a lot more sequencing being performed by many more, smaller research groups. Assays previously performed using mass specs and microarrays are now being performed using next-gen sequencing. Beyond individual reference genomes, we’re resequencing intra-species variants for phylogenetic, association, and population studies. Clinical researchers are thinking about how to include a personal genome within a patient’s electronic medical record, permitting a physician’s diagnosis and annotation.
The BioTeam has had the opportunity to interact in varying degrees with each of the next-gen sequencing vendors, and with several researchers utilizing the vendors’ instruments. The storage and computing requirements for each vendor’s instrument are unique, but can be generalized to 0.2-6 terabytes (TB) of primary image data generated daily and a minimum of 4-8 CPUs in continuous operation to perform image analysis, base calling, and sequence assembly in order to keep up with data production.
The vendors’ solution to this problem is to recommend (or include with their product) a Windows or Linux workstation to drive the instrument with a large local disk to collect the image data. These data are subsequently copied to a shared network file server for analysis by a small (~4 node) UNIX cluster. The advantage of a small cluster with large network storage is that it is a well-established architecture for providing many processors working together to analyze a common set of data. The disadvantages are that this particular problem is more IO-intensive than CPU-intensive (a bottle-neck for clusters) and that even though clustering has become easier over the years, configuring and maintaining the dependent shared network services is still beyond the abilities or duties of a biologists and hospital technicians.
We have participated in the deployment of hundreds of UNIX clusters for many different compute-intensive research problems, helping to make clustering easier for researchers who don’t know how (or don’t want to know). But is a small cluster and shared network file server the most suitable solution for this particular problem? Rather than simply throw the solution we know at the problem, we re-examine the requirements and propose a better-suited alternative.
User: The average biologist or hospital technician knows their research, knows computing from the perspective of desktop productivity tools, but does not have IT experience. Small research labs probably don’t have support of dedicated IT staff.
Processing: Image analysis, base calling and sequence assembly require between 4 and 8 of today’s modern processors in continuous operation to keep up with data production.
Storage: Moving terabytes of image data from one disk to another over a gigabit TCP/IP network using a modest network file server takes hours, even days. Moving that much data through a direct connection on the same computer takes minutes to hours.
Data Archival: Although disks are cheap, DNA sequencing is cheaper. It might cost less to sequence a genome again than to archive the primary data from sequencing it the first time.
Data Management: The results generated by next-gen sequencers are massive. In order to become useful information, these results must be related and re-related by researchers to answer whatever questions they posed by the experiment. This is clearly the most challenging and unaddressed issue.
The ideal compute and storage engine to support a next-gen sequencer is: 1) easy enough for a biologist or hospital technician to operate and manage; 2) provides 4 to 8 processors; 3) provides 10-20 TB of fast disk; 4) and the disk presents itself directly to both the instrument driving workstation and the compute engine without copying the data.
So we thought... let’s replace the 4-node cluster with a single 8-core system. This provides the necessary compute power, but gets rid of all that clustering business. Then replace the shared network file server with a direct-attached fiber channel disk array. This provides the same storage capacity but much faster IO performance. Finally, move the independent Windows/Linux instrument driving system to a virtual machine operating on the same system. This permits the instrument to write to the same disk that the compute engine reads from without copying the data. The disadvantages are that we can’t easily add more processing power if we want to, and the CPU/IO of data analysis might hinder data gathering in the VM layer.
For our validation experiment, we used an Apple Mac Pro with dual 3-GHz Quad-core Intel Xeon processors and 8 GB of DDR2 RAM, operating Mac OS X 10.4, attached by Quad-Channel 4-Gb Fiber Channel to 10.5-TB Apple Xserve RAID configured with RAID level 5 mirroring. With this hardware configuration, we used Illumina’s data analysis pipeline (version 0.2.2.5) to analyze a small run consisting of a sequenced bacterial artificial chromosome including 68,000 images. Under these conditions, the analysis execution time was around 4 hours — the same as a small cluster with network disk. We captured a disk image of Illumina’s Galaxie workstation and demonstrated that we could run it within a Parallels VM layer. (We did not attempt generating new sequence while analyzing prior sequence.)
Over the next few years, a lot more sequencing will be done by smaller groups that won’t have dedicated IT staff to manage clusters and file servers. A single 8-core system with fast, direct disk may be adequate and perhaps better suited to the problem.
Email William Van Etten at firstname.lastname@example.org.
Subscribe to Bio-IT World magazine.