By Kevin Davies
February 6, 2009 | MARCO ISLAND, Florida – Four months after stunning the next-generation sequencing community with its ambitious plans to launch a cut-price genome sequencing service, Complete Genomics CEO Clifford Reid presented details of its first human genome assembly on Thursday evening at the Advances in Genome Biology and Technology conference.
Reid’s much anticipated talk focused on the company’s first human genome assembly – a project that was disclosed last October, even though the firm’s scientific advisory board had yet to review it. In total, Reid said Complete had sequenced the genome of an anonymous Caucasian male (one of the HapMap samples) to 90-fold coverage, generating a total of 630 gigabases (Gb) of sequence, of which about 250 Gb have been mapped to the reference genome sequence. “Not an interesting individual, but one where we had some good HapMap data so we could at least measure ground truth,” Reid told Bio-IT World in a briefing prior to his talk. The proportion of mapped data would have been larger but for a power cut that temporarily shut down the firm’s sequencing facility last Christmas.
Reid said the company has submitted a disc drive with the raw data earlier this week to the National Center for Biotechnology Information (NCBI), and invited fellow researchers to perform their own genome assemblies. If they can’t download the full file (20 days on a T1 line!), Complete Genomics will FedEx a disc.
The sequence covered 92% of the genome. “The 8% we didn’t sequence is the 8% you’d expect – the long repeats, the telomeres and centromeres, the places short-read sequencing doesn’t reach into very well,” said Reid. The single-read accuracy was “about 0.34% discordant.” That is, Complete sees a base at odds with the reference genome about 1 in 300. “That’s pretty good for raw read accuracy,” he said. The assembled discordance is tougher to measure, said Reid, but after selecting a high-quality subset of the HapMap database, they found 170 discordances, which were confirmed by Sanger sequencing. Reid said the assembly concordance exceeds 99.99%. “It’s obviously a very high quality genome,” he said.
Co-founder and chief science officer Rade Drmanac told Bio-IT World the quality of the assembly compares favorably to other published genomes. “The 92% [coverage] is exactly what can be expected,” he said. “The [missing] 8% is long repeats and segmental duplications.” Drmanac added that his group is improving the long-fragment read technology, simplifying the sample preparation, and initiating automation. “We’re excited about the progress with library preparation, how much more efficient things are.”
The analysis revealed 3.3 million single nucleotide polymorphisms (SNPs), including 350,000 novel variants. “This one has the exact same number of SNPs as Watson’s genome – 3.3 million – unbelievable!” said Drmanac. He added, “This is either the fifth or sixth genome with published reads in the NCBI. I think in addition to demonstrating our technology, it’s a great contribution.” A manuscript is in preparation.
Boys of Summer
Reid insists that Complete remains on course to meet the initial goals outlined last October: to sequence 1,000 human genomes in 2009 and 20,000 in 2010, although he admits the schedule will be tough. “Our plan is still to release genomes at 40X coverage,” Reid said. “Our pricing remains the same. We plan to ship genomes at $5000 each.” However, Reid said the exact price schedule would not be announced until June. “We’ll figure out exactly what’s included, the volume discounts, that sort of thing,” said Reid. “We know the target price is $5000. Our materials cost of the last genome was just under $4000. The materials cost is coming down nicely as we automate and ring out the waste in the system.” The company has just started to sequence samples in collaboration with Lee Hood, and Reid also announced an agreement with the Broad Institute to sequence five genomes.
“There’s this big Gantt chart on the wall, a lot of things have to happen. [It will take] a combination of more instruments and further speed improvements. The first commercial generation of instruments we ship should be able to do 1000 genomes this year… The jeopardy would be we don’t have floor space to put them in!” To that end, Complete Genomics just leased an additional 32,000 square feet of lab and office space to house the genome center, and aims to have the center operational in August. It will include a new generation of faster sequencing instruments. Said Reid: “The current instruments we’re running are R&D boxes. They’re not as fast as the commercial systems. We‘ve got two prototypes of commercial systems coming up.” The R&D instruments are running at about 70 Gb/run. “For a $5000 genome, these boxes need to get faster still,” said Reid.
“Once we have this genome center operational, we remain committed to the plan of building additional genome centers around the world,” said Reid. “We expect to put those all over the place. That’s how we’ll expand the organization, rather than trying to build one monolithic genome center in Mountain View, California – a reasonably dumb place to put a genome center given the costs down here.”
Complete identified some 400,000 short indels [insertions/deletions] using its own proprietary software, but Reid admits there is room for improvement. “The assembly software does not today call large structural variations,” he acknowledged. “That’s one of our next high priority projects -- to tease out of the datasets major structural rearrangements, inversions, translocations etc.” Reid calls it “a strategic commitment to write the assembly software that spans the spectrum of variance detection from SNPs to assembling a cancer genome.”
From a current throughput of 70 Gb/run, Reid said in his talk he is aiming for 200 Gb/run by this June, and 600 Gb/run by the end of the year. Sequencing costs would come down with scale, but the largest line-item was currently computing, eventually becoming imaging. Reid showed a movie of the DNA nanoballs dropping into the gridded wells like tiny roulette balls. The final capacity will be 1 billion wells. By 2010, the data center would contain 60,000 cores and 30 petabytes of disk drive. Reid said it would be half the size of the largest computing center in the world.
In a statement, Complete Genomics advisor and Harvard Medical School professor George Church said the genome assembly was “a major achievement” that “surpassed expectations.” Having reviewed the data, Church said his team had “confirmed that it falls in line with what is expected of an individual genome. It is highly concordant with previously published work on this genome and with data from public variation repositories."
Reid concluded his presentation by stating his goal was to offer “a turnkey solution for the scientific community.” The focus would be entirely on human genomes – not mouse or any other model organism. “Send us your samples,” he told the audience. “We’ll sequence them, we’ll assemble them, we’ll generate the variants list, and we’ll send it back to you quickly. When we’re doing your assembly, 60,000 processors are going to light up!”
“We’re a wholesaler of complete human genomes to the scientific community. We’re have no intention of writing NIH grants,” said Reid, adding he planned to partner with genome centers such as the Broad, research centers such as the Institute for Systems Biology, and the direct-to-consumer companies. The five-year mission was to build ten genome centers around the world that would sequence 1 million genomes in that period.
“We’re trying to make sequencing completely ubiquitous.”