BGI Scientists Expand Reference Genes for Human Microbiome

July 14, 2014

By Aaron Krol

July 14, 2014 | The Beijing Genomics Institute (BGI), China's gene sequencing powerhouse, has released a set of reference genes for the human gut microbiome, in a catalogue that is substantially larger and covers a greater diversity of human populations than any previous resources. The work is described in a recent Nature Biotechnology paper, "An integrated catalog of reference genes in the human gut microbiome," by senior author Jun Wang of BGI-Shenzhen, while the reference itself is freely available at meta.genomics.cn.

A reference set of genes that have been found in organisms living in the human gut is an essential resource for profiling the species present in a person's microbiota, and can also help to estimate their abundance and phylogenetic relationships, or to identify species that are correlated with aspects of human health. However, as the authors note, "there has been no comprehensive and uniformly processed database that can represent the human gut microbiota around the world." The two largest previous reference catalogues, from the MetaHIT project and the Human Microbiome Project (HMP), have contained imperfectly sequenced and redundant genes, and have only sequenced samples taken from individuals from Europe and the U.S., respectively. The BGI team combined sequencing data from both of those projects with hundreds of Chinese samples from a study of diabetes, plus 249 newly-sequenced samples from Europe. In order to adequately cover the genomes of organisms that occur commonly in the human gut, but at such low abundance that few reads can be recovered from them, the team also integrated reference genomes of bacteria and archaea from the NCBI and EMBL databases for any species that were 90% covered by the combined samples used in this project.

The resulting catalogue, the Integrated Gene Catalogue (IGC), contains nearly 10 million unique genes — a greater than 70% increase over either the MetaHIT or HMP resources. Because of a stricter quality control pipeline, the IGC also eliminates large proportions of short or fragmented genes from the prior databases. When using the IGC to assemble metagenomes from both the sample sets used in the creation of the IGC, and three independent sample sets, in all cases between 74 and 81% of sequencing data could be mapped to the IGC. The authors suggest that this is "close to the maximum achievable mapping rates," given the estimate that prokaryotic genomes have on average 87% gene content.

The impressive breadth of the IGC allows for some interesting observations. Individual samples used in the project contained roughly 760,000 genes on average, and any two samples would share roughly one third of those genes in common. Each sample contributed an average of 469 genes found in no other sample. As in other microbiome references, the species identity of most genes remains a mystery; only around 16% could be confidently assigned to a genus. While nearly all species found in a large proportion of samples were already known to be part of the human microbiota from previous studies, the wine-fermenting genus Oenococcus, found in 13.5% of samples in the IGC, had never previously been shown to live in the human gut.

Based on their experience creating the IGC, the BGI team offer a number of suggestions for future investigation of the human gut microbiome. They speculate that "we may have reached saturated coverage of core gene content and functions, but rare genes will continue to be discovered," adding that most of the new genes included in the IGC were found in only a small minority of individuals. They also propose that, while deeper sequencing of individuals is a tempting way to get better read depth of low-abundance species, it may in fact be more cost-effective to simply sequence more samples at current read depths. In the case of the low-abundance genus Enterococcus, the IGC was able to improve coverage by over 70% thanks to a handful of samples where the genus was found in unexpectedly high abundance, a finding that may be repeated with other organisms.

Discovering more of these rare genes, the authors suggest, may shed a great deal of light on important functional differences between humans' commensal organisms. While the genes of known function that are found at high frequency in the IGC tend to cover basic processes like metabolism and signal transduction, those found in fewer than 1% of individuals tend to be involved in adaptive processes, like DNA repair, antibiotic resistance, and responses to phages and the human immune system. Covering more human populations is also likely to yield new functional insights: in a comparison of Danish to Chinese samples, using the IGC as a map, genes highly divergent between the two groups tended to be involved in the metabolism of specific carbohydrates, amino acids, and vitamins, strongly suggesting a relationship with human diet.

"Similar to the field of human genetics, where the search for new alleles has progressed from common to rare," the authors conclude, "our data indicate that cataloging of our 'other genome,' the human gut microbiome, is also entering the stage for identification of rare or individual-specific genes."

With the IGC made available to all researchers around the world online, it is likely that in the coming months new studies will appear using the IGC as a reference map, helping to show whether outside groups find the new catalogue a useful and reliable tool for studying the human microbiome.