Researchers Assemble Pan-Genome From 910 Humans Of African Descent Using Deep Sequencing

November 20, 2018

By Bio-IT World Staff

November 20, 2018 | Researchers from Johns Hopkins University and other institutions recently conducted a study that used a deeply sequenced dataset of over 900 individuals of African descent to construct a set of DNA sequences that is missing from the current reference human genome. According to the researchers, the African pan-genome contains close to 10% more DNA than the current human reference genome. The results from the study were published in Nature Genetics (DOI: https://doi.org/10.1038/s41588-018-0273-y).

Despite efforts to improve the human reference genome, the authors of the study argue, the current human reference genome is limited in its usefulness for genetic studies. This is due to the fact that the human reference genome is primarily derived from a single individual.

“The lack of diversity in the reference genome poses many challenges when analyzing individuals whose genetic background does not match the reference,” the authors write. “Findings from the 1000 Genomes Project indicate that differences between populations are quite large; examination of 26 populations across five continents revealed that 86% of discovered variants were present in only one continental group. In that study, the five African populations examined had the highest number of variant sites compared with the remaining 21 populations.”

Pan-genomes, which are collections of sequences representing all of the DNA in a particular population, are crucial in addressing this disparity. They are also very rare, according to the authors.

“Although multiple pan-genomes have been created for bacterial species, as of yet, there are no pan-genomes for any other animal or plant species,” the researchers write. “The lack of pan-genomes is due in part to the technical challenges of assembling many deeply sequenced genomes de novo and combining them into one genome.”

The researchers at Johns Hopkins looked to create a pan-genome based off the work done by the Danish Genome Project, an effort to establish a high quality Danish reference genome. The authors write that they chose to focus on a heterogeneous group of admixed individuals for the purpose of their study.

“Because the human reference genome is largely complete (the sequence has very few gaps), our strategy for creating a pan-genome focused on finding large insertions. This approach, although computationally demanding, made the African pan-genome assembly process described here feasible,” they write.

In order to be thorough, the researchers aligned 1.19 trillion reads from the 910 individuals, selected from 19 populations across the Americas, the Caribbean, and continental Africa, to the reference genome, collected all reads that failed to align, and assembled these reads into contiguous sequences. Sequencing was performed using an Illumina HiSeq 2000.

Authors of the study discovered 296.5 Mb of novel DNA distributed across 125,715 sequences assembled from the 910 individuals of African descent, which equals to 10% of the genome.

The findings, the authors write, demonstrate a lack of DNA sequence data from the standard human reference genome when compared to other human populations. The novel DNA found would “necessarily be missed by any efforts relying only on GRCh38 [the current iteration of the human reference genome] to study human variation, as nearly all studies do at present.”

“Overall these results suggest that a single reference genome is not adequate for population-based studies of human genetics,” the authors write. “Instead, a better approach may be to create reference genomes for all distinct human populations, which over time will eventually yield a comprehensive pan-genome capturing all of the DNA present in humans.”