Scientists Finally Finish The Quest For A Gapless Human Genome

April 7, 2022

By Deborah Borfitz

April 7, 2022 | There was no escaping last week’s announcement that the final 8% of the human genome—the equivalent of an entire DNA chromosome—has been sequenced, filling a gap left nearly 20 years after the Human Genome Project declared victory. More than 100 researchers from around the world collaborated on the initiative and many of them, and the institutions they represent, have weighed in on the feat.

Where the original $3 billion project failed was in reading the many big chunks of DNA containing highly repetitive sequences, which in 2003 were largely dismissed as junk. “It turned out that many of the regions I was interested in were in the gaps,” remarks University of Washington investigator Evan Eichler, in a news report posted on the website of the Howard Hughes Medical Institute (HHMI).

The new reference genome, dubbed T2T-CHM13, adds nearly 200 million base pairs of novel DNA sequences, including 99 genes likely to code for proteins and nearly 2,000 candidate genes that need further study. It also corrects thousands of structural errors in the current reference sequence.

Eichler’s work was heavily referenced in a research article (one of six) published March 31 in Science (DOI: 10.1126/science.abj6987) heralding the first-ever sequencing of an entire human genome. His team had shown the value of using Pacific Biosciences technology to resolve more complex forms of genetic variation. Ultimately, long-read sequencing with Pacific Biosciences machines were combined with data from Oxford Nanopore to fill all the gaps. 

Most of the 23 chromosomes were assembled via a “hackathon” with scientists working remotely over Zoom and Slack during the pandemic, according to the HHMI report. Members of the 3-year-old Telomere-to-Telomere (T2T) Consortium spearheading the latest sequencing achievement are now working to replicate it using a genome with different chromosomes inherited from each parent—as well as launch a pan-genome effort to read the entire DNA sequences of hundreds of people globally.

“The new human genome is incredibly accurate at the base level, allowing us to flag hundreds of thousands of variants that had been misinterpreted by mapping them to the standard reference,” says Karen Miga, assistant professor of biomolecular engineering at the University of California, Santa Cruz, who helped organize the T2T Consortium, in a news article posted by the university. “Many of these new variants are in genes known to contribute to disease. We can now spot those because we have a more complete and accurate reference genome.”

Miga’s research has focused on satellite DNA, the long stretches of repetitive DNA sequences found mostly in and around telomeres and centromeres. “By far the largest portion of new sequences added to the reference are centromere satellite DNAs,” she says. “For the first time, we can study ‘base-by-base’ the sequences that define the centromere and can start to understand how it works.”

As explained in an article put out by the University of California, Davis, home to several investigators contributing to the gapless sequence, the new reference genome comes from a single human sample, although not exactly a person. The DNA came from a cell line derived from a bundle of cells (a “hydatidiform mole”) that form when an egg in the uterus loses its own genome but gets fertilized by a sperm.

The resulting cell ends up with two identical copies of each chromosome, unlike most human cells, which carry two slightly different copies. Despite its odd origin, there’s nothing to suggest anything out of the ordinary with the cell line’s genome, says Megan Dennis, assistant professor of biochemistry and molecular medicine. The original human reference genome was stitched together from several people, creating some errors and artifacts.

Genomes collected by the 1000 Genomes Project are now being reanalyzed using the new reference sequence to discover and verify thousands of new variants, Dennis says in the post. The 1000 Genomes Project is an international collaboration to create a catalogue of human genetic variation.

According to a report posted by the University of California, Berkeley, scientists there are using the new reference genome as a scaffold to compare the centromeric DNA of 1,600 individuals from around the world, revealing major differences in both the sequence and copy number of repetitive DNA around the centromere—a pattern previously seen with genetic variants when groups of ancient humans migrated out of Africa to the rest of the world.

“What we found is that in individuals with recent ancestry outside the African continent, their centromeres, at least on chromosome X, tend to fall into two big clusters, while most of the interesting variation is in individuals who have recent African ancestry,” states Nicolas Altemose, a postdoctoral fellow at the university. “This isn't entirely a surprise, given what we know about the rest of the genome. But what it suggests is that if we want to look at the interesting variation in these centromeric regions, we really need to have a focused effort to sequence more African genomes and do complete telomere-to-telomere sequence assembly.”   

Many early-career researchers and trainees played pivotal roles in the T2T project and, among those hailing from the University of Connecticut, one of the major contributions was developing a compendium of the repeats in the genome, according to a post on the university’s website. They found the repetitive sections contained mobile elements capable of jumping from one part of the genome to the other as well as viruses and new repeats no one had identified before, including some that carry genes.

More than a dozen researchers from Johns Hopkins University worked on different aspects of the international initiative, including how the completed genome will enhance understanding of gene regulation and expression. One group of investigators, for example, showed how it improves the ability of scientists to understand variations in the genomes of individuals from different populations. As stated in a report on the John Hopkins website, they used a cloud-based platform of the National Human Genome Research Institute to analyze data from more than 3,200 people from around the world to reveal more than a million previously unknown genetic variants.

As pointed out by biomedical engineer Justin Zook in a post made by the National Institute of Standards and Technology, much work remains to be done before clinics put the T2T reference into practice—specifically, “to understand the accuracy of DNA sequences of many individuals in regions of the genome that this reference now makes accessible.” Researchers have used the current reference to analyze millions of genomes and will now need to grasp the ins and outs of the new reference in the same way to properly interpret results when using it, he says.