YouTube Facebook LinkedIn Google+ Twitter Xingrss  

DNA Nanoballs, Petabytes of Data Mark Complete Genomics Platform


By Kevin Davies

Oct. 6, 2008 | With the unveiling today of the next phase of the next-generation sequencing era by Complete Genomics (see accompanying Bio-IT World story) comes intense scrutiny of the sequencing-by-hybridization (SBH) strategy that the company says will deliver one million genome sequences in the next five years. The technology hinges on exquisite precision in manufacturing and arraying “nanoballs” of DNA as well as the ability to manage gargantuan quantities of data.

Last week, Bio-IT World spoke to the two men directing the sequencing and data handling aspects of Complete Genomics’ new sequencing service.

Rade Drmanac, co-founder and chief science officer of Complete Genomics, has pioneered SBH approaches in genome analysis for two decades, most noticeably co-founding HySeq (now part of Nuvelo). A decade ago it was premature, but “I think now is the right time for this technology,” he says, adding that his team has kept a belief in the advantages while exploring ways to remove the limitations.

DNA nanoballs array
Drmanac says the new SBH platform is “a real simple high-throughput technology” that measures millions of spots of DNA in parallel. A second advantage is that the individual DNA reads are gathered independently, rather than building a chain. “Every probe is read, washed and removed, so we don’t have any chaining,” which tends to accumulate errors with the growing fragment.

A Billion Nanoballs
Similar to the sequencing technologies used by Applied Biosystems and the Polonator, Complete Genomics uses a ligation strategy rather than DNA polymerase. But Drmanac says the method, dubbed cPAL (combinatorial probe-anchor ligation), is based entirely on his group’s research.It’s all [intellectual property] my team has evolved or developed in the last two years. We didn’t license any other IP,” he says.

The first step is to prepare a gridded array of up to a billion DNA nanoballs, or DNBs. These DNBs are concatamers of 80-basepair (bp) mate-paired fragments of genomic DNA, punctuated with synthetic DNA adapters. The 80-bp fragments are derived from a pair of roughly 40-bp fragments that reside a known distance apart (say 500 bases or 10,000 bases). “We insert an adapter to break the 40 bases into 20 bases or 25 bases,” which acts like “a zip code or address into the DNA,” says Drmanac.

The sample preparation amplifies the DNA templates in solution rather than an emulsion or on a platform. It produces about 10 billion DNBs – each about 290 nm in diameter – in just 1 ml solution. “We spent lots of energy to make them small and sticky to the surface,” says Drmanac. The DNBs are spread onto a surface gridded with 300-nm diameter wells (prepared using photolithography) spaced just 1 micron apart. The DNBs settle into the wells like so many balls dropping into the pockets of a roulette wheel.

“The DNA nanoballs are negatively charged, so when the first DNA nanoball gets to the surface (usually positively charged) and sticks to the surface, it repels all other DNBs that come to that spot,” explains Drmanac. “We have that exclusion principle working: without that, 33 percent of the spots will have single DNBs, 33 percent will be empty, and the other third will be doublets or triplets. In our case, we have 90-95 percent with a single DNB per spot.”

In most other next-gen technologies, Drmanac says, “the spots are randomly distributed on the surface. Focusing on the cost, we knew that’s not good enough for medical applications. You lose a factor of five if you do that. The imaging is dramatically more expensive if you don’t have a gridded array.”

The latest Complete Genomics array features one billion spots at 1-micron density on a regular microscope slide, for a theoretical sequencing capacity of 70 gigabases per slide (a run takes a week’s run). The initial human genome that Complete Genomics says it sequenced last summer used slides containing 350 million spots.

Typically, sequencing methods using DNA ligase manage individual reads of just six bases from the ligation site. The cPAL method extends this range to 10 contiguous bases. The four adapters inserted into each concatamerized DNB each have two ligation sites, affording reads of multiple adjacent 10-base DNA segments. Pools of probes correspond to each queried base position; the identity of each queried base is read out by the matching fluorescent tag at the end of the probe.

The cPAL protocol works something like this: after an anchor is hybridized to the first DNB adapter, 10 pools of probes are added in succession. Each probe contains the four base permutations at the site being interrogated, along with a matching fluorescent tag. Only one of the four probes will hybridize because of the specificity of the DNA ligase. Next, all non-ligated probes are washed away and an image is recorded.

Crucially, Drmanac says, the entire ligation complex – anchor and tag – is then removed. “We get this clean slate – we’re back at the beginning and we restart the process. We’re ready for position two.” Moreover, the SBH strategy allows bases to be read in any order with equal accuracy. Under optimal conditions, the company says its error rate is less than 0.1 percent.

Data Dilemma
The task of building the data center to manage the SBH data and assemble genome sequences falls to vice president of software Bruce Martin, a former executive with Sun and Openwave who was between start-ups when recruited.

“I’ve built a team that is a little microcosm of what you see in the rest of the company,” says Martin. His “formidable team” includes bioinformaticians who worked with Craig Venter on genome assembly and the HapMap project, as well as experts in data mining, indexing databases, and high-throughput computing.

The imaging steps involve measuring hundreds of millions of spots. “We are currently generating close to a gigabit a second off the imager, and that’s going to go up by a substantial amount in the next year,” says Martin. “If you think about multiplying that times the number of sequencers, I have not only an extremely interesting computational challenge here, but there’s just a bandwidth problem… You can’t store images at that rate onto disk drives without spending a king’s ransom in terms of storage.”

The images themselves “are a transient intermediate artifact that comes off the camera, and they get processed down to intensities and base calls,” says Martin. “Frankly, at these rates, over a short number of days, you’re shooting 50, 60, 70 terabytes of images. They don’t have a huge amount of utility.”

Martin says his group has had “a very successful run” with a clustered storage system from Isilon, which he likes for its “very high performance” and ability to scale to multi-petabyte file systems. “You can manage it with a very small footprint of staff. The Broad [Institute] recently deployed them as well. I couldn’t say who got there first. We both basically have selected them for similar reasons.”

Due to space, power and cooling considerations, Martin says he’s exploring options with several high-density blade vendors. “We want to pack as many cores and as much memory into as small a footprint as we can for economic reason,” says Martin. He says he’s looking at “all the normal suspects,” but doesn’t have a favorite so far.

Martin says he’s made “a significant investment in an aligner” for rapid genome alignments that can scale to thousands of processors. “I went out and found some very significant expertise in Silicon Valley in terms of high-speed, large-scale search and indexing. We have many of the leading companies in the world in that area.”

Further priorities are to optimize the algorithm to extract additional signal and quality from the data sets to benefit the downstream resequencing assembler. But Martin says the paired 35-base reads – as other platforms have shown – is sufficient to produce a satisfactory genome alignment.  “It’s not a fundamentally groundbreaking thing.”

If the ramp up for 2009 sounds daunting – 1,000 genomes in a center housing 5 petabytes of data – the specs for sequencing 20,000 genomes in 2010 are positively frightening. “We’ll probably be in the 60,000-processor and 30-petabyte range in that time frame,” says Martin.

___________________________________________________

This article appeared in Bio-IT World Magazine.

Subscriptions are free for qualifying individuals.  Apply Today.

 

 

 

View Next Related Story
Click here to login and leave a comment.  

0 Comments

Add Comment

Text Only 2000 character limit

Page 1 of 1


For reprints and/or copyright permission, please contact  Terry Manning, 781.972.1349 , tmanning@healthtech.com.