June 8, 2011 | Third-generation sequencing company Pacific Biosciences (PacBio) began commercial shipment of its PacBio RS single-molecule sequencer in April. The instrument has been in beta testing at 11 institutions in North America and elsewhere for the past year. A notable success was the recent sequencing and identification of the cholera strain sweeping Haiti after the devastating 2010 earthquake.
PacBio staffers Kevin Corcoran, Jon Sorenson and Edwin Hauw previewed the new suite of software tools on the RS sequencer. The SMRT (single molecule/real time) Analysis software suite features web-based software, an analysis pipeline framework, and algorithms for sequence alignment and de novo assembly.
“We’re accelerating the development of software with the community,” says Kevin Corcoran. “A key feature of third-generation sequencing is that [the technology] doesn’t match up with what’s out there now. The key features of the PacBio system include fast time to result, high granularity, long read lengths, and new sequencing modes, including a circular mode and strobe sequencing.”
A revelatory feature of the SMRT software portal is that it captures kinetic information—the time for each registered nucleotide to be captured and incorporated into the growing DNA strand. “This is the first time you can watch DNA polymerase in real time, so that kinetic information will provide additional applications that have never been enabled before,” says Corcoran.
The SMRT View genome browser “takes advantage of our longer reads and kinetic information,” says Sorenson. It includes strobe and consensus sequence modes, allowing the user to visualize and interact with secondary analysis sequence data. Sorenson demonstrated the kinetic data by displaying sequence of a yeast genome: A color-coded “heat map” shows the time between successive pulses, and appears highly reproducible across multiple reads. Independent runs across a given stretch of sequence can be stacked on top of each other to compare the kinetics of incorporation.
“There is sequence context to the kinetics. Much of that is short range but it is reproducible,” says Sorenson. In time, such information could allow PacBio in principle to increase sequence accuracy and study epigenetic effects by detecting modified nucleotides.
The SMRT Portal is an open-source, browser-based application that supports standard sequence format. That means that next-generation sequencing (NGS) data from other platforms, such as Illumina, Life Technologies, and Roche, can be integrated with PacBio data. This hybrid approach has been used, says Sorenson, to close difficult gaps in some microbial assemblies. “We haven’t done whole genome human hybrids yet, but we’re moving towards that,” he says.
PacBio exports sequence data into the SAM/BAM format for sequence alignment. It is also using a variant-calling format adopted by the 1000 Genomes Project called VCF. SMRT Portal enables third-party software analysis and collaboration, facilitated by a python-based framework for secondary analysis functions called SMRT Pipe.
“We embrace openness,” says Sorenson about PacBio’s decision to make the entire secondary analysis software open source. “We have a DevNet web site, a developer-based site for getting data and information. Our APIs are very modular in how we approach the system. We want to work with ISVs and academic collaborators to either plug in their own tools or, vice versa, to promote connectivity and cooperation.”
Among a number of algorithms on offer is BLASR (which stands for Basic Local Alignment with Successive Refinement) which conducts sequence alignments against the reference human genome. “It’s based on widely used strategies, but more a synthesis of several different strategies,” says Sorenson. “It can align more than several hundred megabases in an hour on a multi-core machine. It’s very fast.”
ALLORA (A Long Read Assembler) is a de novo assembler, based on an open-source package called AMOS. “We used parts of AMOS and other parts customized to our particular read types,” says Sorenson. The EviCons tool is used for consensus calling, matching sequence calls to the reference and assessing whether a given call is an error or a polymorphism.
Several notable NGS software providers have signed up to be ISV partners with PacBio, including CLC bio, Geospiza, Genologics, GenomeQuest, DNAStar, and The BioTeam, not to mention Amazon Web Services.
The first commercial PacBio RS units are being shipped to sites including biotechnology companies, service providers, government and academic organizations. The National Biodefense Analysis and Countermeasures Center (part of the Department of Homeland Security) will use its new instrument for characterizing microbial pathogens.
The Wellcome Trust Sanger Institute (UK) has upgraded to the commercial hardware. Harold Swerdlow, director of sequencing technology at the Sanger, says he intends to use the instrument “to improve pathogen de novo assemblies, to increase the coverage of sequence information from organisms like the malaria parasite... [and] to explore epigenetics via direct detection of methylated sites.” •