PacBio Users Share New Tools and Applications at Meeting in Baltimore
By Bio-IT World Staff
June 19, 2014 | This Tuesday, the Institute for Genome Sciences (IGS) at the University of Maryland, Baltimore played host to the annual East Coast user group meeting for Pacific Biosciences. While PacBio has never been in a position to challenge market leader Illumina on cost or throughput, its SMRT sequencer is the most differentiated device currently being sold, thanks to read lengths more than an order of magnitude longer than its competitors. With over one hundred SMRT sequencers now running around the world, geneticists and bioinformaticians are reporting new uses for PacBio’s technology faster than ever before. The ability to sequence genomes in fragments thousands of base pairs long has opened up projects that would be monumentally difficult, if not impossible, with standard next-generation sequencing.
The projects described at the PacBio user group meeting may be of particular interest now, as a second long-reading instrument, Oxford Nanopore’s MinION, is undergoing careful scrutiny in its early access period.
Whole Genomes from Scratch
One of the fastest-growing applications of SMRT sequencing has been de novo assembly of whole genomes, especially in microbes. “It was only last year that we published the first paper that discussed PacBio de novo assembly methods,” remembered Luke Hickey, PacBio’s Director of Business Development, in his address at the meeting. “Since then, there’s been a tremendous amount of work done.”
While short-read technologies almost always rely on a reference genome to place reads in context, with longer reads it becomes exponentially easier to piece a genome together from scratch. As Luke Tallon, Scientific Director of the IGS, observed, at the “golden threshold” of five- to seven-kilobase reads, it becomes possible to build an entire E. coli genome in one end-to-end contig, eliminating gaps. This is well within the capabilities of the current PacBio chemistry, which usually delivers half its data in reads over 10 kilobases long. De novo assembly is more reliable in catching structural variation than reference-guided assembly, and is not vulnerable to mistakes in the reference itself.
Tallon and his colleagues at the IGS have been relying heavily on the Institute’s SMRT Sequencer for an ongoing project to build new reference genomes of clinically relevant microbes for the NCBI’s GenBank database. In one early batch of 50 different Staphylococcus aureus samples, the IGS was able to assemble 32 of the samples in single contigs, a level of completeness that makes it easier to validate genomes. The IGS will be expanding this project to over 550 different microbes, as reliable long-read data makes it possible to speed up the process of creating new reference genomes.
While the IGS works to improve the back-end resources available to microbiologists, other groups are bringing SMRT sequencing into more active settings. Sean Conlan of the National Human Genome Research Institute reported on a study of a carbapenem-resistant Klebsiella pneumoniae outbreak. Like many outbreaks, this one was complex in its origins and transmission — and further complicated by the presence of dozens of plasmids, including at least two that carried genes for carbapenem resistance.
To trace the relationships between bacteria and plasmids isolated from different patients, Conlan’s group had to sequence a large number of samples, covering enough of the genome that similarities between key regions of unrelated plasmids and chromosomes would not be misleading. PacBio instruments let the team reliably close whole plasmids and chromosomes, capturing all the data they needed to trace the outbreak. Previously, this work had been done through a painstaking process that integrated short-read data, targeted PCR and optical mapping.
In one case, Conlan believes his team’s data may have overturned the reference sequence for a complex, repetitive region in a key antibiotic resistance plasmid. “You start to doubt your references when you’re dealing with data of this quality,” he said.
Scaling to Human
While bacteria and archaea, with their small, haploid genomes, still lead the way in de novo sequencing, PacBio users are also turning their attention to more genetically complex organisms. PacBio itself has released reads for the Drosophila, spinach, goat, and human genomes, among others. The human genome in particular is an important frontier for de novo sequencing, which can be more informative than the reference-guided resequencing that is now the standard. As Richard McCombie, a psychiatric geneticist at the Cold Spring Harbor Laboratory, said, “Resequencing is great in some ways. We can do it very inexpensively; it’s under $3,000 to do a whole genome now on an Illumina machine. But it misses some structural variants, and it misses some regions of the genome.”
One intriguing project centered on SMRT sequencing of human DNA is taking place at the Genome Institute of Washington University. There, a rare type of haploid human sample, the result of an abnormal pregnancy in which only the sperm contributed DNA to the embryo, has become the basis for improving some of the roughest regions of the human reference genome. PacBio’s assembly of this sample has an N50 contig length of over 4 megabases — by far the most contiguous human genome ever constructed other than the reference genome itself.
Tina Graves-Lindsay, the leader of the reference genomes group at Washington University, reported on efforts to integrate pieces of this assembly into the reference genome. The haploid sample can provide clearer information in areas where high levels of structural variation between alleles have led to ambiguous assemblies when using diploid samples. To refine the data supplied by PacBio, the Genome Institute has been building libraries of BAC clones that cover the most disputed regions, and sequencing those with a SMRT Sequencer. “Many of these clones will actually end up in the reference, so if the region is a mess in the reference, we can use this sequence to fix it,” said Graves-Lindsay. The long-read data has already helped to resolve questions about the SRGAP2 and IGH genes.
“Our ultimate goal for this is to end up with a single-allelic representation of the entire genome,” she added, which could be used as a complete reference with very little structural ambiguity.
The Best Tools for the Job
New instruments and computational tools have been essential to improving the quality of PacBio data. At the user group meeting, the BluePippin device from Sage Science was repeatedly credited with opening up the long-read potential of SMRT Sequencers. By allowing users to choose fragment sizes during library preparation, up to the kilobase range, the BluePippin ensures that long-read sequencing is not limited by the DNA sample. More than one user reported that their N50 read lengths doubled after using a BluePippin.
Meanwhile, software tools created both at PacBio and by the company’s user community have helped to make sense of long reads. The HGAP tool has become the standard for de novo assembly of bacterial genomes, and PBJelly, the brainchild of Adam English at the Human Genome Sequencing Center (HGSC) at Baylor College of Medicine, has automated the process of using long reads to fill in gaps in draft genomes. The still-experimental FALCON is PacBio’s follow-up to HGAP, for assembling diploid genomes.
Several presenters at the user group meeting shared new tools that can be added to this arsenal, all of them freely available. William Salerno, from the HGSC, described PBHoney, which tweaks PBJelly to focus on finding structural variants. In the same way that PBJelly searches for reads that span or extend into gaps in an assembly, PBHoney searches for reads that span error events in the assembly caused by structural variants. It can also identify reads that map partly to the edge of an error event, with dangling “tails” that map somewhere else, and use that information to resolve long inversions and duplications. The HGSC has tested PBHoney on both E. coli and human samples, and successfully resolved structural variants in both.
For scientists using SMRT sequencers on larger genomes, Adam Phillippy, the principal bioinformatician at the National Biodefense Analysis and Countermeasures Center (NBACC), presented a dramatically faster alternative to BLASR, the standard tool for overlapping long reads that is used as part of HGAP. The NBACC first saw the need to replace BLASR after assembling a Drosophila melanogaster genome using PacBio reads. The project itself was very successful, putting together a more contiguous assembly than the species’ reference genome in just six weeks. “There was so much data generated that we were able to assemble the genome using only reads greater than 17kb,” said Phillippy. “It was unlike anything I had seen at the time. There were entire chromosome arms that were assembled into a single piece.”
However, that assembly also underscored the high compute demands of BLASR, which took up over 600,000 CPU hours, or more than 90% of the project’s total compute time. To reduce those demands, Phillippy’s colleagues Konstantin Berlin and Sergey Koren turned to an algorithm developed for the AltaVista search engine in the 1990’s. Although this algorithm had not been widely used in bioinformatics, it provided a novel way to rapidly track the similarity between two sets of data — whether that data is duplicate web pages, or long DNA fragments that need to be overlapped.
The algorithm was repurposed as MHAP, an algorithm that reduces long reads to a much smaller set of values without losing information. MHAP divides each read into a set of k-mers, and then runs each k-mer through hundreds of hash functions, each of which outputs a value. Then, for each hash function, the program stores only the k-mer that produced the smallest value. This allows a dramatically faster search for overlaps between reads, by comparing a small set of numbers instead of a massive string of DNA bases.
The results were impressive; running the Drosophila reads again through MHAP instead of BLASR reduced the compute time from 600,000 CPU hours to just 1,000, while improving the assembly. Casey Bergman of the University of Manchester went on to show that the entire process could be performed through Amazon Web Services for under $300. Smaller, prokaryotic genomes, like E. coli, could be assembled in half an hour on a desktop computer — and as genomes get larger, MHAP realizes greater and greater efficiencies. A preliminary version of MHAP, written in Java, is available through SourceForge.
Other niche uses of SMRT sequencing were discussed at the meeting, including the ability to trace methylation patterns of DNA, and the IsoSeq method, which uses long reads of RNA to capture alternative splicing, showing which different protein isoforms are likely to be present in a sample.
Meanwhile, PacBio continues to work on its chemistry, which has repeatedly doubled the SMRT Sequencers’ average read lengths year over year. Kevin Corcoran, the company’s Senior Vice President of Market Development, said that the next release of a combined chemistry update and improved library loading is slated for early 2015, and is predicted to bring the instrument’s N50 read length to between 12 and 18 kilobases.
Luke Hickey also announced that PacBio has entered a partnership with GenDx to release an HLA typing system. HLA typing has traditionally relied on the much slower and more labor-intensive Sanger sequencing, because the complexity of the region is too great to be reliably captured with short-read next-generation technologies. The combined GenDx-PacBio system will be the first commercial solution to use next-generation sequencing to deliver complete HLA genotypes.
“We’ve been thinking about larger problems at PacBio, and more diverse problems,” said Hickey in his concluding address. While PacBio is unlikely to seize a huge share of the sequencing market for the foreseeable future, the company is finding more and more niches for its unique chemistry. As the user group meeting in Baltimore demonstrated, most problems tackled by PacBio users could not be adequately addressed by any other sequencer on the market today.