Getting to Know the New Reference Genome Assembly

By Aaron Krol

January 27, 2014 | Last month, the Genome Reference Consortium (GRC) released a new human reference genome for the first time since early 2009. Although the previous reference genome, GRCh37, had been “patched” on a quarterly basis beginning in 2010, fixing isolated problems and adding significant new information, the new assembly is a much more fundamental update. The new reference, GRCh38 (or “build 38” for short), describes far more human genetic variation than ever before, and offers access to large areas of the genome that had previously been almost complete mysteries.

The reference genome provides a template by which sequencing reads can be mapped to their chromosomal locations. It is an indispensable resource for geneticists worldwide, who use it to piece together sequences, understand the context of reads, and find areas of genetic variation by comparing genomes against a “standard” sequence.

It’s also a one-of-a-kind operation, an artifact from the earliest days of human whole-genome sequencing that still consists mainly of DNA reads from the donors of the first human genome assembly, and is still maintained by four organizations – the National Center for Biotechnology Information (NCBI), the Genome Institute at Washington University, the Wellcome Trust Sanger Institute, and the European Bioinformatics Institute – that were leading players in the original Human Genome Project. These organizations make up the GRC, which is responsible for keeping the reference genome at the forefront of accuracy and completeness, and is still plugging gaps and making refinements over a decade after the Human Genome Project was declared complete.

To understand the changes in GRCh38, Bio-IT World spoke to Valerie Schneider, the leader of the NCBI team working on the reference genome. She stressed three major kinds of updates: repair of incorrect reads, the inclusion of model centromere sequences, and the addition of alternate loci.

Gaps and Errors

The reference genome is by far the most accurately sequenced human genome in the world. It doesn’t correspond to any single human’s DNA – a mélange of a few dozen donors made up the original sequence, and new donor sources have been added since – but across the hybrid sequence, incorrect reads are extremely rare. They have to be, because any false bases in the reference genome have an impact on all sequences assembled against it.

“The sequence of the reference assembly is based on Sanger sequencing,” says Schneider, “which is still considered to be somewhat the gold standard. It’s a finished assembly, which in terms of quality means that every base is sequenced to 99.99% accuracy.”

That’s around ten times as accurate as most of the next-generation sequencing technology in use today. Nevertheless, in a sequence of 3 billion bases, one missed call in ten thousand still adds up. The GRC relies on a vast user community to call attention to errors in the reference assembly. “I really can’t stress enough how much we value contributions from collaborators,” says Schneider. “While we have excellent curators working on the reference assembly, there are a lot of groups out there who are experts in specific genomic regions, who are doing genome-wide analyses, that even if it’s not their intent can often spot potential trouble spots in the reference assembly.” The GRC’s homepage contains a link for users to “Report an Issue,” and also lists hotspots where the GRC suspects the reference could be improved.

Build 38 alters around 8,000 single nucleotides across the genome, which “in many cases will improve the annotation or analysis of clinically relevant genes,” says Schneider. These aren’t massive issues, but they can cause genome assemblers to call single-nucleotide polymorphisms (SNPs) where they don’t exist, complicating the task of deciding which SNPs are functionally important.

On a more foundational level, a few large areas that had been misassembled in GRCh37 have been dramatically retiled. “These are genic regions,” Schneider emphasizes, not just non-coding DNA. They include the pericentromeric region of 10q11*, which borders and may occasionally interact with the RET proto-oncogene, and the srGAP region of 1q21 involved in neuron development.

GRCh38 also adds new sequences to help fill those pieces of the genome that have never been fully captured. New DNA donors, new technologies, and new analytical methods are all helping to close the gaps that still confound total coverage of the human genome.

One novel resource that GRCh38 takes advantage of is a hydatidiform mole, a type of abnormal pregnancy that occurs when a sperm fertilizes an enucleated egg. That sperm then reduplicates its DNA, resulting in two identical copies of each chromosome. Since the resulting cell has no allelic variation, it can be used to generate unambiguous reads in regions of very high diversity. “Because there’s only a single haplotype in a genomic mole,” says Schneider, “it eliminates the problem that we have in complex regions of distinguishing allelic sequence differences from paralogous sequence differences.” With typical donor sources, if researchers find two very similar sequences, it can be unclear whether those reads are different alleles, or repetitive sequences that belong together on the chromosome. With a haploid mole, any slight variations must come from the same chromosome, and can be recorded as duplications in the sequence.

The GRC is also beginning to take advantage of new sequencing technologies, some of which can access long repetitive regions, or regions with very high or low GC content, which are difficult even for Sanger sequencing. “We are very excited and interested in the long-read sequencing technologies that are coming out via PacBio, or hopefully with Oxford Nanopore,” says Schneider. Most sequences generated on PacBio sequencers aren’t yet publicly available, but the GRC did incorporate some changes to the MUC5AC gene based on a PacBio sequence posted in GenBank. As more sequences generated on long-read technology are placed in public repositories, the GRC will have more opportunity to benefit from advances in capturing these difficult regions.

The biggest leap in coverage in GRCh38 comes from the centromeres, which are modeled for the first time in the reference genome’s history. Once thought to be important only during cell division, centromeres have generated new interest as the evidence mounts that their sequences may have real impacts on function.

“These centromeres have been really a black box to most biologists for a long time,” says Schneider. “But they’re really a very important part of our genome. And they’re not just structural, and playing a role in chromosome segregation, but they’re actually a major source of sequence variation – and there are analyses out there that suggest they might have roles in human disease, and they’re likely subject to epigenomic regulation.”

Unfortunately, centromeres are also millions of bases long and hugely repetitive, making them all but impossible to sequence with current technology. GRCh37, like all previous builds of the reference genome, represented them as standard 3-megabase gaps, which fails even to express the variation in size between different chromosomes’ centromeres. Build 38 hasn’t quite cracked the centromere puzzle yet, but it does include model sequences that are better representative of centromeres’ makeup, thanks to the efforts of Karen Miga and her colleagues at UC Santa Cruz.

Miga’s group took reads from a previous whole genome sequencing project, and applied a new analytical approach to create estimated centromere sequences. Centromeres are defined largely by very slight variations on a massively repeated 171-base sequence. The UC Santa Cruz team collected together all the fragments that aligned with that pattern, and then, without trying to piece the entire set of reads together in a linear fashion, looked for which variations on the basic pattern were present; at what frequency each of these variants occurred; and where possible, the positions of these variants relative to one another. The result is a model centromere for each chromosome that is accurate for size and contains approximately true proportions of the variation found in the centromere, albeit in a somewhat arbitrary order.

“Because of the variation that’s represented in centromeres,” says Schneider, “including these models in the reference assembly provides a reference for analysis... It opens up the field in a way that has never really been accessible before.”

A Wealth of Diversity

GRCh38 also continues something of a philosophical shift in how the reference genome is developed: a growing emphasis on alternate loci. When build 37 was first released in 2009, it contained just nine alternate scaffolds covering three regions of the genome. Alternate loci are used for regions where the human population contains genetic variants so divergent, the GRC has to include more than one sequence to make sure all the different variants can align to the reference assembly. As time passed, the GRC identified more and more areas where the human genome can contain radically different sequences, and the various patches to GRCh37 eventually added 60 new alternate loci.

Build 38 contains 261 alternate loci across 178 regions – one of which, the KIR locus involved in the differentiation of natural killer cells, has been given 35 distinct sequences. This is a huge leap forward in representing the richness of human genetic variation. “We don’t see a need to provide an alternate locus for every SNP or simple structural variant,” says Schneider, since these small variations don’t lead to problems assigning reads to their appropriate chromosomal locations. But the GRC has come a long way in recognizing that huge swathes of the human genome can’t be reduced to a single sequence even for the purposes of alignment.

Although Schneider is optimistic that the new reference captures a large percentage of human diversity, she knows there is still work to be done. “I don’t think there’s an exponential rise ahead of us,” she says, “but I do think that there will continue to be an increase in the number of alternate loci, at least for the time being.”

She also sees a need to raise awareness of the alternate loci among researchers. “They’re not just add-ons,” she says. “And occasionally we hear reports about users saying, ‘I don’t see any reason to use the alternate loci in my analysis.’” That can be problematic, because users who exclude the alternate loci can end up tossing out reads that don’t align to the “primary” sequence of the reference assembly – when those same reads would map perfectly well to one of the alternate scaffolds. In many cases, the reads won’t even be discarded, but will actually be mapped to the wrong area of the chromosome, where they’re likely to be interpreted as huge hotspots of mutation.

Yet there are also challenges to using the alternate loci with current informatics tools. “One of the issues that we see facing the community is that there aren’t tools and tool chains that are capable of handling the alternate loci,” says Schneider. With most assembly tools, a read that maps to multiple areas of the genome will be discarded as “ambiguous.” Because many sequences fit equally well in more than one alternate scaffold, alternate loci can add ambiguous reads: the assembler doesn’t distinguish between a read that fits in two different areas of the genome, and one that fits in two versions of the same locus.

The NCBI is working on its own tool, SRPRISM, to address this conflict, under the direction of developer Richa Agarwala. SRPRISM has not yet been published, but like all of NCBI’s genomic analysis resources, the tool is publicly available online. Tools like this will be necessary to feel the full benefit of the new reference genome, which more than ever is focused on making the whole spectrum of human haplotypes accessible to research.

Prepping for Use

One month after its release, build 38 is so far available only as raw sequence. Organizations like UC Santa Cruz, the Gene Ontology Consortium, and the Havana Group at the Sanger Institute, who annotate the reference genome to include information about gene function and variants, need time to process the new assembly.

Some of these projects take longer than others. The Havana Project is 100% manually curated, so it requires exhaustive work by its personnel to annotate genes one by one. By contrast, the NCBI Genome Browser collects transcript data from public databases and uses predictive software to model gene function. “This is an automated pipeline that has curator oversight into it, where tweaks can be made if issues are identified,” says Schneider. For the essential information about genes, RNA transcripts and proteins, the process moves very quickly, and the NCBI expects to release an updated Genome Browser this February, including annotation of all the new sequences and alternate loci in GRCh38. More complex information, like gene expression data or the inclusion of SNPs and other small variants, will be added in the following months.

In the meantime, an “analysis set” of GRCh38 is available on the GenBank FTP site, in versions with and without the alternate loci. “For GRCh37, we saw a lot of groups rolling their own version of the assembly,” says Schneider. “We wanted to standardize an analysis set, so when users are reporting their data, they’re all referring to the same collection of sequences.” The analysis set is compatible with most informatics pipelines for next-generation sequencers. Different pipelines have different expectations for details like naming conventions and sequence order in the reference assembly, so the analysis set packages the assembly in a format that all these pipelines will be able to recognize and access.

The need to make all research using the reference genome share a common language underlines how far-reaching changes to the assembly can be. There is no alternative to the GRC’s reference genome, so any errors in the sequence will have ripple effects on genetic research worldwide. On the other hand, if the GRC doesn’t keep pace with evolving information about the genome, it can slow the pace of discovery for groups that rely on the reference assembly. There’s a fine balance to be struck between delivering as much of the human genome as possible, and maintaining a scrupulous standard of accuracy.

“It’s the best mammalian genome that’s out there,” says Schneider. “So we really only try to make changes where we think change is needed.”

*Position one-one on the long arm of chromosome 10.