May 15, 2007 | J. Craig Venter recently made his Comedy Central debut on The Colbert Report. Asked by host Stephen Colbert “What makes you think you can do a better job with life and genetics than God?” Venter shot back: “We have computers!” rendering Colbert momentarily (and uncharacteristically) speechless.
If computers were key seven years ago during the heady days of Celera Genomics and the Human Genome Project, they will be even more taxed now. From cancer mutation catalogues to whole genome association studies; evolutionary genomics to Venter’s own global ocean sampling expedition, genome analysis has never been broader or more breathtaking. But until recently, most of these have continued to rely on tried and true Sanger sequencing.
With the introduction in 2005 of the first next-generation sequencing platform, 454 Life Sciences pulled into pole position for the start of a true genomics Grand Prix. In the past 12 months, 454, Solexa, and Agencourt Personal Genomics (APG) have been snapped up by Roche, Illumina, and Applied Biosystems, respectively. Along with other contenders including Helicos BioSciences, VisiGen Biotechnologies, and GE, they are eyeing a massive potential market put at $1 billion and up, while the cost of human genome sequencing has and continues to plunge, to about $1 million. Within a year or two, we could see genome sequences for $10,000 or less.
In March, Roche paid $155 million for 454, the CuraGen subsidiary that launched its debut instrument on the heels of a landmark 2005 Nature paper. 454 has amassed dozens of high-profile publications since then, including groundbreaking work on the Neanderthal genome. In common with some competitors, the system binds a DNA fragment to an individual bead and then millions of copies of the fragments per bead are produced using a process called emulsion PCR. After the beads are deposited into a pico-titer plate, each DNA base (A, C, T, G) is independently washed over the plate allowing sequencing-by-synthesis. A CCD camera records each added base as it happens across 100,000 beads at a time via a pyrosequencing reaction.
454’s second-generation machine — the Genome Sequencer FLX — features several major upgrades over its predecessor, including much improved read-length — up from an average of 100 bases to 240 bases. (Roche and 454 staff at a recent CHI conference hawked T-shirts proclaiming “Length Matters.”) Moreover, 454 has doubled the number of reads per run from 200,000 to just over 400,000. Most importantly, the system has greatly improved accuracy.
“When the system first launched, 454 was very open about its error rate — about 4 percent,” says Tim Harkins, marketing manager at Roche Applied Science. “Now we’re less than 0.5 percent per single 250 base pair read. Unfortunately, the market is still holding us to the original [Nature] publication.” Harkins says that customers have an easier time accepting the longer read lengths than the improved error rate.
Although the FLX costs about $500,000, 454 is upgrading its existing customers as fast as possible, essentially improving the microfluidics while retaining the original CCD camera. At CHI’s inaugural Next-Generation Sequencing conference in San Diego, VP of R&D Michael Egholm declared that 454 had completed the sequencing of the first named human genome using the new GS FLX platform. Fittingly, the subject is James Watson, the co-discoverer of the double helix. Although the sequence is preliminary (just 3.5-fold coverage), Egholm said that Watson’s sequence included variants in 50 known disease-associated genes. To which Watson reportedly said, “Only 50?” (The full analysis will be reported this month at a conference at Cold Spring Harbor Laboratory, where Watson is Chancellor.)
Roche’s acquisition of 454 looks like a bargain compared to the $615 million that Illumina shelled out for Solexa last year. According to Tony Smith, VP R&D UK, Illumina had received orders for more than 40 Genome Analyzer instruments within 30 days of the completion of the product’s early access period. “What Illumina is now able to offer after its acquisition of Solexa is all the various tools that people need to do genome-wide analysis.” That means whole-genome association studies, expression analysis, and sequencing.
Although the shorter individual read lengths of the Illumina 1G Analyzer point to applications in resequencing (where there is an established reference available) rather than de novo sequencing, Smith is delighted with the breadth of initial applications he’s seeing. They include “people sequencing individual genomes, microbes, people looking at individual parts of the genome chasing down disease genes, people studying where protein factors that regulate gene expression actually bind.”
Moreover, groups are looking at ways of deploying Illumina’s Analyzer “in a de novo mode.” Bioinformaticians are developing algorithms and approaches to do de novo assemblies despite the short reads. Smith says the fact that Illumina can generate sequence so inexpensively and copiously has “many informaticians looking to use short sequence reads with paired reads derived from the same platform to enable de novo assembly.” (“Paired ends” provide greater certainty in assembling genomes when each read is fairly short.)
A key issue is “the importance of the quantity/quality equation,” says Smith. Many companies — Illumina included — like to brag about how many gigabases you can generate. “It’s not called the 1G Analyzer for nothing!” he jokes. “But the thing that’s really important is the quantity of data you can generate of high quality... We can get very large amounts of data, but also get over 90% of high quality data. That means you’re not then losing a lot of that data in filtering out.”
Smith continues: “The beauty of our data is that for each base call, there are four fluorescent intensity measurements. That means that the data structure is very similar to what you get from a [Sanger] capillary sequencer.” Smith says quality metrics for Illumina’s data output correlate well with the Phred scores used to grade Sanger data. “So people can use developed tools they have familiarity with, and apply to our data quite easily.” He goes on: “If you want to use sequence from different platforms, it’s easier to actually do that if you can relate the data structures together and be able to equate high quality bases from a capillary instrument with high quality bases from an Illumina instrument.”
Meanwhile, this summer Applied Biosystems is preparing to test drive its expensive new vehicle. At the CHI sequencing conference, applications manager Michael Rhodes acknowledged that Applied Biosystems would be a late starter, but declared its SOLiD system would perform like “a Formula One car.” Applied Bio has enjoyed a near monopoly on automated sequencing for the past two decades since company co-founder Leroy Hood devised the first automated fluorescent instrument in the mid 1980s.
Andy Watson, a founding scientist at the Wellcome Trust Sanger Institute, U.K., joined Applied Bio last year as senior director of market development and strategic alliances to chart Applied Bio’s next-generation sequencing strategy. Says Watson: “We looked at over 40 different companies and technologies, and from that made the decision that the APG technology was the one we wanted to build our next business around.” APG was bought from Beckman-Coulter in July, 2006, for $120 million.
The APG acquisition was guided by the evolution from de novo sequencing to resequencing. “We wanted a technology that was comparing genomes, looking at genomes where there are references,” says Watson. “That pushes you to something that has very high accuracy, very high volume of data being generated, and very much relaxes the need for very long read lengths.”
Watson adds: “We were looking for a technology that could detect larger scale genomic rearrangements, copy number variants (CNVs), transversions, and so on. To do that, you need a technology with this paired-end method. We tried to figure out where the world was going, and then looked through all these technologies and identified a subset of those 40 companies that could meet those needs, could be marketed, and could be integrated well within Applied Bio.”
Two Better Than One
Kevin McKernan, formerly at the Whitehead Institute Genome Center, co-founded Agencourt in 2000. He is now Applied Bio’s senior director of scientific operations, based in Beverly, Mass. He explains that the Supported Oligonucleotide Ligation and Detection (SOLiD) technology is derived from a handful of technologies.
McKernan escorts visitors into a production area where about eight SOLiD instruments are in various stages of assembly and testing. Next to each instrument is a plaque affectionately engraved with the name of a famous female pioneer. There’s Amelia (Earhart), Barbara (McClintock), Rosalind (Franklin), and Florence (Nightingale) among others.
SOLiD sequencing differs from Sanger sequencing and other sequence-by-synthesis platforms both in chemistry and its readout. It uses ligases rather than polymerases to read the sequence. “This allows us to put in an error correction system that is very important for resequencing applications, particularly in cancer,” says McKernan. “When you’re trying to find mutations at a very low frequency, you need to have very high raw accuracy. What we’re finding is the system has tenfold higher accuracy than any other next-generation sequencing system out there.”
There are other advantages. “The density we can achieve with beads is far higher than folks are getting off the other systems,” says McKernan. Denser features produce more information. In fact, the output is expected to hit or even exceed 500 megabases per run.
The SOLiD system, like others, uses four fluorescent tags, but uniquely an ingenious two-base readout system: each ligation step interrogates a pair of adjacent nucleotides. By staggering the ligation steps, each base is interrogated twice (see “Sequence Sudoku,” p. 26).
The color-coding differentiates genetic variants such as SNPs from random sequencing errors. Indeed, Applied Bio thinks the elegant simplicity of the format could change the way sequence is analyzed. Says Rhodes: “We just finished collaboration with [Joint Genome Institute] doing a resequencing project,” in which both Sanger and SOLiD data were generated. “And they converted the Sanger data to color space to take advantage of the rules we have for detecting the miscalls in the SOLiD data.”
McKernan adds: “The [individual] read length on the system is probably the longest out there for any system that generates more than 100 million bases of sequence [per run]. This is primarily an artifact of us delivering people paired-end sequencing.” Every company is developing paired-end methodologies to orient assembly in regions rich in repetitive DNA. According to McKernan, sequencing individual 25mers in humans could only assemble 80 percent of the human genome. “You can get [assembly] up to 93-95% if you have it anchored to another 25mer a couple of thousand bases away.” If an individual 25mer falls within repetitive region of DNA, its anchored mate pair will typically allow its placement within the repeat.
A potential downside of the SOLiD setup is the premium it puts on compute power and storage. The complete SOLiD system including compute and workflow pieces, could push the price above $600,000.
“There’s a computer node that sits under the instrument that handles terabyte image storage, and a handful of other parts for automating the workflow,” explains Rhodes. In fact, it’s a 10-box dual-core CPU Linux cluster (supplied by Dell in the preliminary models) with 15 TB of storage. “We do everything on the machine,” explains Rhodes. “The second the images are taken, we start processing them. Within a couple of hours at the end of the run, we’ve done all the alignment to the reference sequence and it’s ready for you to download off the Ethernet card at the back.” (Five years ago, he says, the cluster would have made the Top 500 supercomputer list.)
“No-one stores the [Sanger] images when they’re small, so who’s going to store them when they’re large? So we want to get you past the images and into analyzing the data,” says Rhodes.
By contrast, a typical 454 FLX run produces a paltry 13 GB of raw image data After data extraction, namely base calling, we’re at a final of just less than 20 GB in total. That’s actually quite manageable, especially nowadays with 500-GB hard drives,” says Harkins. “We’re looking to compress that down so potentially you could burn a DVD for one drive. You could store an instrument run for a few dollars.”
Harkins notes that other next-generation sequencing platforms are talking about terabytes per run. “We’re talking about pushing the science, but these other companies have a dilemma. It could cost more in computer hardware than reagents for an instrument run,” says Harkins.
While Illumina’s Smith agrees that, “The really big data is in the images,” Illumina offers customers the opportunity to store all of their images, “because there will be people who want to do that. The issue is you get into hundreds of GB or even 1 TB [per run].” And that will only increase in the future. “The customer may decide to store a subset of the images for quality control purposes, or store images for a particularly important run and archive them to a tape backup.”
The question for the market, Harkins reckons, is: Do you want to save your raw data? 454 allows users to re-evaluate their raw data. “We had one customer who re-processed his raw data using the updated GS FLX software and is seeing improvements,” says Harkins. “When you’re talking about 1-2% error down to 0.5%, that leads to tangible improvements for downstream analysis.”
But Rhodes dismisses such criticism. With an instrument potentially pumping out 4 Megabases each second over three days, “People don’t need the images, they need the data. What you really want is the result,” says Rhodes.
During a panel discussion at CHI’s Next Generation Sequencing conference, Rhodes said: “Back of the envelope calculations say that if you wanted to store the raw image data, it’s 6 TB a week... that could require you to spend $1 million on storage, backup, and stuff. So unless you think you’re going to want to go back to every image, it’s cheaper to do the experiment again.” Rhodes can see certain situations for storing images, say for a precious cDNA clone. “But as a routine workaday measure, no.”
“Once you’ve got to that stage, you still have a large dataset — if you’re going to generate 1 billion bases per run, you’ve got to have quite a lot of bytes as well as bases,” says Smith. “But you’re no longer in the TB of data, you’re back down in the 100 GB or so. So you can reduce the data quantity by not storing the images.” Smith says many customers already have the necessary compute infrastructure. “But for a lab looking to adapt this technology, we’re recommending that they acquire a compute cluster that has say 8 nodes and 10 TB of disc capacity.” That’s the sort of compute need that companies such as SGI are eagerly looking to feed (see “Green Solutions,” p. 30). Others, such as DNAStar, see opportunities in sequence analysis software (see “Sequence Gazing”).
But Harkins says the market hasn’t come to terms with the dilemma of paying $10,000-20,000 to save a single instrument run’s data. “That’s going to put the market into a bind,” he says. “Throwing raw data away is a paradigm shift I don’t think people are ready for yet.”
Start Your Engines
Applied Bio may fancy its chances of lapping the field when the SOLiD instrument debuts next year, but there will be other competition. Helicos BioSciences, based in Cambridge, Mass., recently filed its Initial Public Offering (and thus declined to comment for this story). CEO Stanley Lapidus has said the company’s Heliscope instrument, scheduled to debut late in 2007, will be able to sequence a full human genome for as little as $5000.
Helicos was founded in 2003 by a group including Lapidus, Eric Lander, and Stephen Quake, an HHMI/Stanford biophysicist. Helicos’ “true single-molecule sequencing (tSMS)” platform is based on technology published by Quake and coworkers in 2003. Requiring no amplification steps, it begins with about 1.2 billion single-stranded templates, a few hundred bases in length, which are attached onto a glass slide. Each newly incorporated base is detected by a fluorescent pulse of light. A computer analyzes the images to determine the sequence of each short strand.
Helicos sees a plethora of applications in whole tumor resequencing, transcriptional profiling, methylation, and candidate region resequencing. The lack of PCR amplification reduces errors, as well as dephasing issues. The HeliScope will produce ~109 bases per day, and could reach ~109 bases per hour with subsequent enhancements. Earlier this year, Helicos said it had developed novel base-polymerase formulations to aid in the sequencing of highly repetitive DNA tracts. Lapidus called it, “an important accomplishment in Helicos’ endeavor to deliver a high-performance single molecule sequencing platform.”
In Houston, VisiGen’s Susan Hardin is working on a “third or even fourth generation” technology, which won’t yield a commercial instrument for at least a few years. “Our technology involves modifying the DNA polymerase with a donor fluorophore, and putting an acceptor fluorophore on the gamma-phosphate group. Then we monitor what the polymerase does in real time,” says Hardin. But her sights are set on delivering very affordable human genome sequence. “We’re the [future] $1000-genome-in-a-day technology, and nobody can offer that yet,” she says.
The Finish Line
It may yet take a few years, but most if not all of the new platform companies will likely be contenders for the Archon Genomics X Prize. This $10 million bounty will be awarded to the first group that can sequence 100 human genomes over ten days for $10,000 apiece. (The prize money was put up by Canadian geologist Stewart Blusson, who discovered the world’s third-largest diamond mine.)
But experts and contenders alike are confused by the position of the finish line. Harkins says there’s a conundrum: “If you look at what’s in the public domain, there’s a double standard. The public domain reference sequence would not be passable for the X Prize current definition for the $10,000 genome. You want a full genome by today’s public reference genome, but today’s [sequence] is not complete. However, we do appreciate that the X-prize sets the bar high — not just for an SNP survey, but insists on characterization of all variation, such as structural variations.”
McKernan agrees: “The goal for sequencing the human genome exceeds the existing quality of the human genome as it is currently entered into Genbank; there’s no way to know if anyone’s really won. Until that gets clarified in the next few years, it’s not really worth thinking about.”
“What you’re looking at right now is something in the $0.5-1 million genome range,” says Harkins. “We have so much more biology to uncover before we get to that $1000 level.”
Watson says Applied Bio wants to “win in the different market segments where we want to win, where we believe we should win.” That does not include de novo sequencing. “If you need 500-base reads to do a good de novo metagenomic study, that’s clearly somewhere we’re not going to be pushing our technology to win,” he says.
Applied Bio is lining up early access partners this summer, but won’t formally introduce SOLiD until early 2008. Rhodes waves off criticism of the lag: “We bought APG in July 2006, and we’ve already got a box ready to go out. It’s just unbelievable!”
Watson also dismisses the suggestion that Applied Bio is playing catch up. “We’re going strong with this technology, there’s an enormous effort behind it... This time next year, you won’t be saying we’re so far behind.”
Subscribe to Bio-IT World magazine.