Next-generation phred software squeezes more data from Illumina runs.
By Kevin Davies
November 10, 2009 | BARCELONA—In one of the highlights of an excellent inaugural European conference on next-generation sequencing in October*, veteran bioinformatican Philip Green presented benchmark results on a software package that extracts significantly more reads and data per run compared to the current Illumina tools.
Green, a Howard Hughes Medical Institute alumni investigator at the University of Washington in Seattle, is one of the unsung heroes of the Human Genome Project for his programs phred and phrap, which provided essential quality measurements on DNA sequence base calls from traditional Sanger sequencing. For the past year, Green has been devising new methods for next-generation sequence data analysis. While his pilot project focuses on Illumina’s popular GA II platform, Green says his methods should be transportable to other platforms.
Green described three primary goals for the work: First, to obtain more reads per instrument run; second, to lower the error rate; and third, to obtain higher quality base-calling values. By squeezing out more usable data per instrument run, Green aims to lower the costs of genome sequencing and simultaneously increase the amount of unique sequence extracted in each experiment.
The next_phred base caller is in fact a pipeline consisting of several different programs. Tiff2intense takes the raw (.tif) flow-cell images from the CCD camera, corrects intensities for spectral overlap, and produces one read per pixel (potentially 3.5 billion reads/flow cell).
Phaster is a fast sequence alignment program that aligns reads to the reference genome. The program builds a decision tree to provide error probabilities and base calls. Finally, phast_lane aligns the cluster reads to identify the single-nucleotide polymorphisms (SNPs.
Head to Head
Green presented the preliminary results of the bake-off between Illumina’s Pipeline software and next-phred. Against the current Pipeline version 1.4, Green reported anywhere from 40% to 250% improvement in alignable reads. Using the Q20 metric as a measure of base-calling quality, Green found improvements of 46% to 220%. Green said it was important not to simply discard data at some arbitrary cut-off and lose valuable data. ”We still want [the data]—just account for the quality,” he said. Rather, what are needed are analysis measures to deal with the quality values intelligently.
Phaster is not based on the Burroughs-Wheeler transform like most fast aligners but a more traditional approach—producing a word index of the reference genome. Against bow tie, phaster reported a slight increase in speed in an alignment against the human genome. A more striking difference was seen in sensitivity—out of a total of 22 million reads, the two programs mapped most reads equally well. However, bow tie mapped just 18,000 alone, whereas phaster mapped more than 1 million uniquely—yielding 6% more mapped reads in total than bow tie.
Green closed by recommending a change of heart with regard to data preservation. “Virtually all genome centers throw away image files,” he said. “Our philosophy is, you should save the [data files].” Green has a utility that reduces the size of images by 75%. Tape storage for the images of a 36-cycle Illumina run amounts to just $12, said Green. “The cost of storage is so much less than running the flow cell. You might want to go back to it someday and get more information. It’s a lot more expensive to run another flow cell than to retrieve the images and get more data out of them.”
Green said he is working on a couple of improvements, but is willing to provide next_phred “to serious beta testers.” He anticipates wider release of the package in January 2010. “Please ask Illumina to keep providing access to the images,” Green implored the audience. “The whole project would collapse if that [policy change] was to be the case.”
More than 150 delegates attended the 3-day conference. Speakers included Rick Wilson (The Genome Center at Washington University) on cancer genome sequencing; Carlos Bustamante (Cornell University) on principal component analysis of human populations; Gabor Marth (Boston College) on bioinformatics tools for next-gen data analysis; and Gil McVean (University of Oxford) on the 1000 Genomes Project.
The meeting included many talks on applications beyond human genomes, including metagenomics and plant genomics. Discussing work on epigenetics, Stephan Beck (University of London) played a video simulation of nanopore sequencing as practiced by Oxford Nanopore Technologies, which has the benefit of being able to detect methylated cytosines, the so-called “fifth base.” Beck noted that reports of a “sixth base” earlier this year make the field even more challenging.
The conference was superbly organized by Miguel Perez-Enciso (UAB, Universitat Autonoma de Barcelona) and colleagues. A decision has yet to be made on whether the meeting will be held in 2010, despite the glaring need for more first-rate forums for the booming European next-gen sequencing community.
*NGS2009: Conference on Next Generation Sequencing: Challenges and Opportunities. Barcelona, Spain; October 1-3, 2009.
This article also appeared in the November-December 2009 issue of Bio-IT World Magazine.
Subscriptions are free for qualifying individuals. Apply today.