PacBio at AGBT

By Bio-IT World Staff

March 3, 2015 | The 2015 Advances in Genome Biology & Technology conference wrapped up over the weekend in Marco Island, Florida, after four days of presentations from the front lines of genome analysis. With less than the usual amount of razzle-dazzle on display in this year’s product launches, the event was stolen by some outstanding scientific achievements pulled off with existing platforms. Pacific Biosciences, this year’s gold sponsor, highlighted several of these in a star-studded workshop Friday afternoon to show off the feats that can be accomplished with its SMRT (single molecule real time) sequencers, the instruments of choice for recovering long-range structural information on the genome. Speakers included J. Craig Venter, who runs the world’s largest genome sequencing center at his company Human Longevity, Inc., and is best known for competing with the Human Genome Project to produce the first whole human genome sequence; Deanna Church, who has helped shape improvements to the human reference genome in her work with the Genome Reference Consortium; and Gene Myers, one of the world’s premier bioinformaticians and co-author of the foundational genome analysis tool BLAST.

In a piece this January looking back on last year’s milestones in genomics, Bio-IT World wrote that “2014 could be looked at as the Year of PacBio, when the [midsize] company proved there was room in the market for a pricier instrument that won’t flinch at high GC coverage, large indels, or de novo assembly.” The present moment might eventually come to be seen as the peak of PacBio’s powers, a window in which the company was truly producing the most comprehensive, highest-quality genomes money could buy.

PacBio’s commercial future is murky: companies like 10X Genomics are toying with more affordable ways to get reliable long-range genomic information, and if Oxford Nanopore gets a handle on its error rates and releases the production-scale PromethION, they’re likely to undercut PacBio on price while delivering the same top-of-the-line features. But whatever its market prospects, scientifically PacBio is driving some of the most innovative sequencing projects going on today. Among other accomplishments, the PacBio workshop at AGBT presented multiple users’ de novo assemblies of whole human genomes — until very recently, a vanishingly rare type of project because no high-throughput instrument could deliver the type of data needed to put together a whole human genome without aligning reads to a reference genome.

De Novo Assemblies as a Commodity?

Today, the very presence of SMRT sequencers on the market has encouraged bioinformaticians to build a whole suite of analytical tools to deal with multi-kilobase reads. As the AGBT workshop made clear, PacBio users now have something like a standard pipeline for going all the way from raw reads to a whole genome. A typical workflow might use Gene Myers’ DALIGNER to find local alignments between reads, FALCON for assembly, and Quiver for variant calling. As PacBio CEO Mike Hunkapillar announced in his opening remarks, DNAnexus recently used this DALIGNER-FALCON pipeline to create a new diploid assembly of J. Craig Venter’s genome, following a sequencing effort that took less than a month to generate all the required raw data on SMRT instruments.

Diploid assembly, correctly distinguishing between the maternal and paternal copies of each chromosome, is the gold standard for a full genome sequence. This ability sets FALCON assemblies apart from even the human reference genome — which, as Deanna Church memorably pointed out in her own presentation, has historically included “Franken-alleles” stitched together from different copies of the same chromosomes.

DNAnexus also appears to have set a world record for the fastest human genome assembly last week, patching together the genome of a peculiar breast cancer cell line, SK-BR-3, in less than 21 hours. The process wrapped up at 10:30 on Friday morning, just in time for a shout-out at the workshop from W. Richard McCombie of the Cold Spring Harbor Laboratory. DNAnexus will now be making this workflow available to all customers through its cloud-based informatics service, offering rapid assembly to any labs with the sequencing capacity to drive through enough PacBio reads.

All this is starting to make de novo assembly look less like a titanic enterprise, and a little more like a commodity. Venter, giving the first talk at the workshop, revealed plans to produce an extraordinary 30 new reference genomes at Human Longevity, Inc., combining two SMRT sequencers with his bank of 20 ultra-high-throughput Illumina HiSeq X instruments. “I’m delighted with the focus I’m hearing here, on getting back to assembled genomes,” said Venter. “If we’re going to understand each of our genomes, we need to do de novo assembly.”

The collection of new reference-grade assemblies at Human Longevity isn’t just a matter of showing off; getting new reference genomes from donors with diverse ethnic and geographical backgrounds will help with all future interpretation of large structural variants, which differ widely between human populations and are difficult to square with a single reference assembly. (Sadly unmentioned was whether and when Human Longevity might share its reference genomes with the wider scientific community.)

Venter, of course, has a knack for thinking big. His 30 reference assemblies will represent just a small fraction of the one million whole genomes he intends to sequence by 2020. In his presentation, Venter even spoke glibly about the pace at which he hopes to see his massively expensive bank of sequencers (an investment in excess of $21 million) become obsolete, based on the historical trend toward ever-cheaper sequencing. “We’re counting on $30 genomes in three or four years, and hopefully we can truck away to the dumpster all the machines we have [now],” Venter said.

Many of our readers should also be interested to hear that Venter casually mentioned looking to hire around 200 new bioinformaticians for his company in 2015.

The second speaker, Gene Myers, was also keenly interested in the possibilities PacBio has opened up for relatively straightforward de novo assembly. Myers spent many years in the 2000’s more or less out of the limelight, reportedly because he was dissatisfied with the industry’s trend toward using short-read sequencers and reference alignment for most applications. However, he reemerged at AGBT last year, after a conversation with Hunkapillar in which Myers learned that SMRT sequencers deliver long reads with both random sampling of the genome, and random, unbiased error rates at any point in the genome.

“As a mathematician, when Mike used this word ‘random’ in those two places I got incredibly excited,” said Myers at this year’s workshop. “Because I understood, from theory alone, that what that meant was immediately that perfect assembly was back on the table.”

Since then, Myers has been hard at work making perfect assembly a reality. In addition to building DALIGNER, he has also started work on a new tool called DAscrub, which was a major focus of his workshop presentation. The purpose of DAscrub is to clean up raw PacBio reads, which are error-prone and vulnerable to sequencing artifacts, without sacrificing valuable data. Myers presented an E. coli assembly produced with 30x coverage of the sample that produced a complete circular genome without requiring any correction steps between running DALIGNER and performing full assembly, except for using DAscrub to clear out artifacts.

Key Genomes

None of these advances in de novo assembly will do much to advance science if we don’t choose samples that truly have something to teach us. The last three speakers at PacBio’s AGBT workshop rounded out the afternoon with some compelling applications for this burgeoning technology.

Deanna Church, formerly of the National Center for Biotechnology Information and now Senior Director of Genomics and Content at genetic diagnostics company Personalis, shared her thoughts on using long-read data to update the human reference genome, and in particular to deal with regions of high structural complexity and large differences between human haplotypes. This is a subject Church has spoken about with Bio-IT World before — in fact, in Hunkapillar's opening remarks he quoted an interview we ran with Church in April 2013, in which she said that “if we are truly going to be successful in having genomics affect clinical medicine and we want to understand variation within individuals, we have to have de novo assembly.”

At AGBT, Church noted that the reference genome is essential even when working with de novo assemblies, both as a resource for calling variants, and as a coordinate system for describing those variants. That means missing or confounded sequence in the reference can cause problems for interpretation no matter how scrupulous a new genome may be.

Church touted the addition of many alternate loci in the latest update to the human reference genome, which allow geneticists to consider multiple “paths” through variable regions. She also urged bioinformaticians to update their tools to take these alternate loci into account, something that few groups have done to date. “In aggregate, these alt loci contribute an additional 3.6 megabases of novel sequence that contain 153 unique genes,” said Church. “So if you are not using these sequences in your analyses, you are missing part of the exome, and you are missing some important sequence.”

At the same time, Church acknowledged that the patchwork of alternate loci, in the long term, is not the most efficient way to represent large structural variants across the genome. In a question-and-answer session, she mentioned the Global Alliance for Genomics and Health, which is working on an alternative way to represent chromosomal positions as a branching graph that spans an entire chromosome. “I think this movement to this graph-based representation is really the way we have to go,” she said, “because it allows us to represent this complexity in a much more natural way.” While Church expects it to take some time before this structure is ready to be as widely adopted as the current standards for representing genetic variation, she did say that the alternate loci provide a “graph-lite” approach in the current human reference assembly.

The fourth speaker, Jeong-sun Seo of Seoul National University and Macrogen, presented on a critical new resource for genomics, a diploid assembly of a whole Asian genome. “We have to consider seriously ethnic differences for personalized medicine,” Seo reminded the audience. Ultimately, Seo’s work on this new assembly, of a genome donated by an Altaic Korean individual, is meant to support an Asian Genome Project recruiting 10,000 patient volunteers for whole genome sequencing across South Korea, Japan, China, and Mongolia.

Like Human Longevity, Macrogen has a bank of HiSeq X instruments and has been using a cross-platform approach to generating new reference assemblies. Interestingly, Seo mentioned that his team is also using an Irys device from BioNano, which uses fluorescent markers to map out very large structural variation on the order of hundreds of kilobases. In an interview with Bio-IT World, BioNano CEO Eric Holmlin recently told us that the Irys has been paired with SMRT sequencing but declined to reveal more details; Seo’s presentation offers at least one example of both techniques for getting long-range genomic information being used in parallel.

Highlighting the magnitude of difference between the Korean assembly his group performed and the standard reference genome, Seo noted that on chromosome 20 alone, he was able to pinpoint nearly 500 structural variants, totaling over 210 kilobases inserted or deleted relative to the reference. He also shared one example of a phenotypic difference that appears to be traceable to one of these structural variants, an 8-kilobase insertion in the NINL gene related to pigmentation. “NINL is the most significantly differentially expressed gene between Asians and Caucasians,” Seo observed, a fact that can likely be attributed to this large insertion. Other structural variants that differ widely between ethnic groups are likely to have direct relevance to health and disease risks.

The final speaker was W. Richard McCombie, whose own assembly of interest was the previously-mentioned SK-BR-3 cell line, collected from a Her2-positive case of breast cancer. The SK-BR-3 genome is profoundly disordered — so much so that Hunkapillar, introducing McCombie’s talk, said that looking at this genome, “you wonder how in the heck was this thing alive?”

McCombie, much like Myers, believes that short-read sequencing has been a mixed blessing for the genomics community, offering more data than ever before but at the cost of distracting researchers from profoundly important sources of variation. He quoted Evan Eichler’s term “the seduction of next-gen sequencing,” which he called “very appropriate. You can get really good SNP data from a very large number of individual genomes… but you do miss… a lot of the structural variants.”

Turning to the SK-BR-3 genome, McCombie showed some detailed data, derived from SMRT sequencing, on complex translocations between chromosomes 8 and 17, which occurred across multiple different sites on both chromosomes. With more precise information on precisely how these regions are arranged, which translocations have undergone inversion, and the complete sequence of gene fusions, McCombie’s team is now trying to reconstruct the precise history of the structural events that have produced the SK-BR-3 chromosome 17, particularly at the locus where the Her2 gene resides. Happily, McCombie announced that all his data on this genome is publicly available online, and that he will soon be releasing methylation data as well — something that can be recovered routinely off SMRT sequencers.

PacBio is still very much a niche player in sequencing, and with a notably lower throughput and higher costs than its competitors, that’s unlikely to change any time soon. Nonetheless, the company has done a remarkable job drawing attention to features like haplotypes and structural variants that cannot be captured by short-read sequencing. While the genomics community never really forgot about these factors, they have been shortchanged in the drive for more and cheaper data in the next-generation sequencing era.

Today, it seems possible that projects like those presented at PacBio’s AGBT workshop are just the leading edge of a cultural shift in genomics toward full representations of genomic variation and more routine use of de novo assembly. The full force of that shift will have to wait for technology that brings long-read data in reach of the average user. But whether that comes from future PacBio instruments, a new contender like Oxford Nanopore, a parallel platform like 10X Genomics, or a combination of all three, this year’s AGBT demonstrated that the groundwork has been laid to make the best use of this data once we have it.