Long-Read Sequencing Data Enables Structural Variant Discovery For Clinical And Disease Research

Contributed Commentary By Luke Hickey

September 11, 2017 | In recent years, scientists have become increasingly aware of the clinical implication of structural variants in the human genome. These larger variants, typically defined as >50 bp, are known to cause many genetic conditions, including fragile X syndrome, Duchenne muscular dystrophy, ALS, and Tay-Sachs disease. However, due to their size, structural variants have not been as easy to study as single-nucleotide variants using short-read sequencing methods. Whether they are insertion variants that are too lengthy to span with such methods, repetitive regions that create mapping ambiguity for short reads, or GC-rich regions that challenge technologies with a systematic coverage bias, these important genomic elements have been missed by most efforts to sequence the human genome.

The advent of single molecule, real-time sequencing, with average read lengths exceeding 12 kb, has allowed scientists to reliably detect structural variation for the first time. These latest studies have shown that previous human genome data sets have dramatically underrepresented the number of structural variants—along with their effect on health and disease.

Medical Genetics

In-depth information about structural variation has already proven beneficial for clinical researchers looking to increase solution rates for rare and undiagnosed diseases. A recent paper from Stanford’s Euan Ashley and collaborators, for example, demonstrated that long-read sequencing enabled detection of a disease-causing pathogenic structural variant for a particularly challenging case.

The patient involved in this study had been in and out of hospitals for 20 years, beginning at the age of 7, with a disease that defied diagnosis. While symptoms were consistent with a rare disease known as Carney complex, genetic testing and short-read whole genome sequencing failed to detect a causative mutation. Ashley and his team turned to long-read sequencing and found what had eluded scientists and clinicians for two decades: a 2.2 kb deletion in PRKAR1A, the gene implicated in Carney complex. That discovery was based on 8-fold coverage of the human genome, indicating that even low-fold, long-read coverage can provide important clinical insights about structural variation.

In another example, scientists at the University of California, Davis, used long-read sequencing to plow through a repeat expansion associated with fragile X syndrome. It was the first time the region had ever been fully sequenced; the CGG repeat in some cases covered kilobases and was intractable with earlier sequencing methods. Having an accurate count of the repeat is essential for diagnosis not just of fragile X but also for many other repeat expansion disorders such as ALS and Huntington disease.

Disease Research Studies

Comprehensive profiling of structural variation is also important in genetic disease research. Finding genes responsible for unsolved genetic conditions will require methods that can detect the full range of human genetic variation, from single nucleotide variants to larger structural variants.

Significant progress toward the goal of fully representing structural variation has been achieved by scientists from the University of Washington and the McDonnell Genome Institute. They used single molecule, real-time sequencing to find structural variants in two haploid human cell lines, cataloguing insertions, deletions, and other types of variants as short as 2 bp and as long as 28 kb. Analysis of those 460,000-plus structural variants and indels determined that more than 89% had been missed by the 1000 Genomes Project. The scientists performed extensive follow-up studies to validate the structural variants detected and concluded that most are in fact common among the human population.

“The theoretical amount of genetic variation in a single human diploid genome far exceeds expectations established by previous whole-genome studies,” the scientists reported in their paper. “Although this represents only a fraction of variant sites between two haplotypes, this missing variation accounts for most of the variant base pairs between two human genomes.”

This new class of structural variation information will be particularly useful for large-scale population studies, such as the many country-specific genome projects that are happening now. Adding this new type of variation will improve the overall genetic discovery power and increase the number of association links between phenotypes to genotypes across the larger cohort.

Moving Forward

Recognition of the coming need to understand and interpret structural variation has led to calls for establishing a complete, publicly available database of this class of variants. Once created, such a database would be invaluable for rare disease research and genetic association studies, much like dbGaP and gnomAD. I join those urging the community to establish such a public repository. There are already private initiatives endeavoring to create similar databases, such as the recently announced project by Novogene to sequence 1,000 Chinese individuals and create a population-specific database of common structural variants.

Now that we have a reliable and affordable method for interrogating structural variation, it is time to start including the full range of human genetic variation in both clinical and disease research studies. Only then will we be able to fully power our studies for genetic discovery and realize the full promise of precision medicine.

Luke Hickey is a senior director for human biomedical applications at PacBio, a company that provides long-read sequencing solutions based on its Single Molecule, Real-Time (SMRT) technology. He can be reached at lhickey@pacb.com.