PacBio’s Focus on Human Genomics And Read Length Sweet Spot
By Allison Proffitt
July 21, 2020 | It seems like a very long time ago, now, that Illumina and Pacific Biosciences dissolved their merger agreement, but in fact their announcement is just about six months old. Since then, PacBio has been busy reestablishing itself as an independent business and looking ahead, their sights set on the human genome sequencing market.
The dissolution of the Illumina merger—called off due to “the lengthy regulatory approval process” and “continued uncertainty”—did require some back-pedaling. “We were actively moving as much as we could without the full regulatory review having completed,” Luke Hickey, Senior Director of Strategic Marketing, told Bio-IT World. But Hickey is optimistic about PacBio’s future.
“There was a fairly large breakup fee that ended up giving us significant operating capital,” Hickey said of the $98 million termination fee. “We have a healthy balance sheet and we’re—if anything—more energized to continue as an independent company within the long read space.”
As for competitive exposure, Aaron Wenger, Principal Scientist at PacBio, is not worried.
“I know that there was some sharing of information, but PacBio has been developing SMRT sequencing for, I think, 20 years,” Wenger said. “I think there are ways to protect the technology through patents that Illumina can't legally copy, and then also I think many trade secrets that they did not have access to.” An added strength, Wenger noted: “the many people who could actually execute that on the day to day basis here are still employed at PacBio.”
Hickey agreed. “We feel like we’re in an incredibly strong position with the patent portfolio and—to Aaron’s point—how many years it’s taken to develop a robust and reliable system for doing the sequencing.”
The PacBio sequencing system announced an overhaul itself in 2019. About a year ago, the company launched its Sequel II platform. PacBio shipped 104 Sequel II platforms to customers between June and December 2019, Hickey said, highlighting a “healthy mix” of existing and new customers. While PacBio’s historical customers may have focused on microbial or plant and animal sequencing, “We have seen just great adoption from the human biomedical market,” he emphasized. He name-dropped a few groups to adopt the new platform: the NIH’s All of Us program, Mercy Children’s Hospital in Kansas City, Ira Hall’s group at Washington University in St. Louis, and Evan Eichler at the University of Washington.
The Sequel II system takes advantage of a new read type—HiFi reads—that the company announced in January 2019. The corresponding paper was published in Nature Biotechnology in August 2019; Wenger is first author (DOI: 10.1038/s41587-019-0217-9). The team optimized circular consensus sequencing (CCS) to improve the accuracy of single-molecule real-time (SMRT) sequencing and generate highly accurate (99.8%) long high-fidelity (HiFi) reads with an average length of 13.5 kilobases (kb).
“What we were able to do, was significantly increase the efficiency of being able to get multiple observations of the DNA molecule,” Wenger explained, through repetitive or circular sequencing of the same DNA. “That first time through you detect with 87% accuracy. Then you go around the turn and go back in the other direction and now you get another read that’s 87% accurate. But it’s an independent read from the first one. Now you have two things, each of which are 87% accurate, but you make different mistakes in the two times through. … By making multiple observations in series of the same physical DNA molecule, you can generate—in software—a consensus of those passes,” he said. “If you do 5 passes, you can get 99% accuracy; if you get 8 passes, you get 99.9% accuracy.”
The result is highly accurate reads that are opening up new applications.
“Our CCS method matches or exceeds the ability of short-read sequencing to detect small variants and structural variants,” Wenger and his co-authors wrote in August 2019 paper. “De novo genome assembly using CCS reads alone produced a contiguous and accurate genome with a contig N50 of >15 megabases (Mb) and concordance of 99.997%, substantially outperforming assembly with less-accurate long reads.”
The Sequel II platform enabled researchers to generate the new read type at greater scales than before. By April 2019, at least five early adopter institutions were using Sequel II platforms, and Luke Tallon from the University of Maryland Institute for Genome Services reported seeing, “comparable data quality with yields 8-9 times higher than that of the Sequel.”
The new platform and read types have been included in several publications including a paper from the PacBio team with scientists from the University of Washington describing improved assembly and variant detection of a haploid human genome published in the Annals of Human Genetics (DOI: 10.1111/ahg.12364). Two additional papers are in review and published on bioRxiv.
Already there have been some interesting trends emerging, Wenger said: “some things you would have predicted, and some things we didn’t.”
Sequencing has labored under a “general wisdom”, Wenger said, that short, very accurate reads are critical to calling small variants like single nucleotide variants, while do novo assembly required long reads, even with less accuracy. But the HiFi reads, Wenger says, have struck a middle ground. They are accurate and longer (though not as long as longest PacBio reads): about 15,000-20,000 base pairs at greater than 99% accurate.
The question is, how does do these HiFi reads perform compared to either the short or long reads we are used to? “What’s interesting is actually seeing that—at least for human genomes—these highly accurate, pretty long reads just significantly outperform even reads that are 100,000 or 150,000 bases long,” Wenger says.
A HiFi read that is 20,000 base pairs long and 99% accurate is superior to a 60,000 base pair read that is 85% accurate, Wenger argues. “When you call something in the genome a repeat, they’re never 100% identical,” he said. “If two things are 98% identical in the genome, but your reads are 99% accurate, then you can tell those two things apart. This accuracy lets you resolve more repeats than having long reads.”
Wenger says this balance between read length and accuracy is not only scientifically interesting, but also somewhat surprising. “At this current point [with the Sequel II], what the platform can deliver is likely enough to completely assemble human genomes from telomere to telomere,” he says.
There is a group working on that. The Telomere to Telomere Consortium is an open, community-based effort headed by Adam Phillippy of NIH/NHGRI and Karen Miga of UC Santa Cruz. Their goal is to sequence human chromosomes from end to end—including tricky centromeres—filling hundreds of gaps across the genome including segmental duplications, ribosomal rRNA gene arrays, and satellite arrays.
“We aim to finish these remaining regions and generate the first truly complete assembly of a human genome,” the group writes on their website. “The ultimate goal of this effort is to drive technology to dramatically increase the throughput of complete, high quality telomere-to-telomere assemblies from diploid human genomes.”
At AGBT in 2019 (Feb 27-March 2), Phillippy presented the sequence of the full human X chromosome which was assembled, then “iteratively polished” with a combination of technologies from Oxford Nanopore, PacBio (both the continuous long reads and the HiFi reads), 10x Genomics/Illumina, and Bionano Genomics. Since then, Wenger says, HiFi has been used to sequence through more centromeres. “Now with the HiFi reads, a number of groups have been able to assemble centromeres. Now, I think, they’ve assembled nearly half the human centromeres using 15,000 to 20,000 base pairs and 99% accuracy rate,” he said. (The sequenced X chromosome was published July 14 in Nature, DOI: 10.1038/s41586-020-2547-7.)
The Human Pangenome Reference Sequence Project is another project making use of the combination of long reads plus accuracy, Hickey added. That project includes researchers from the University of Washington, Washington University and the University of California, Santa Cruz. The two sequencing centers—University of Washington and Washington University—are using the PacBio HiFi reads as the primary sequencing technology for assembly, Hickey said. The group will also use Oxford Nanopore for scaffolding work. “The group has done 30 human reference genomes this year,” Wenger added—a huge increase from output in the past.
The Sequel II and HiFi reads mark a shift in PacBio’s technical focus moving forward. Wenger says that 20,000 bases is, “likely sufficient for anything you’d want to do in a human genome,” so pushing for longer read lengths isn’t a priority. “There are still a few centromeres that have to be finished, so maybe you’ll need 30,000 for those. But I’m kind of skeptical. I think that 20,000 is going to be plenty and that it’s just continued algorithmic improvements that will take it the rest of the way,” he said.
He also pointed out a practical reason that 20,000 base pairs is an ideal read length. “We found that the 20,000 base-pair-range is the comfortable range where people can collect samples, consistently process them, and generate good libraries at that read length. When you try to push to more extreme read lengths, there become a lot more limitations on sample collection and libraries fail a lot more. That’s not a problem with the sequencing technology, that’s a problem with the upstream DNA handling.”
Hickey agreed. “The next step for us is not necessarily to drive up read lengths, because we think that is sufficient, but to drive down the cost. That’s our roadmap for 2020… We are trying to get to one human genome, on one SMRT cell, in one day.” For just over $1,000 in consumable costs, Hickey points out, and could give you, “all of the answers you want.”
Hickey sees clinical utility in the platform’s future. The NIH-funded research has an eye toward eventual clinical applications and he mentions a HudsonAlpha group exploring the Sequel II for diagnostics.
“As those studies get published, that will really set us up for maybe more of push into the clinical market for using the whole genome sequencing with HiFi in a more routine way,” Hickey said. “But certainly for the next few years, we’ll be in this basic human genetics research, and then all of this translational, clinical, exploratory research realm to demonstrate utility.”