March 16, 2010 | Certain signs of trait selection among human populations are easy to recognize—skin pigmentation, height, shape, and the ability to digest lactose are some examples.
In a superb paper published in Science, Pardis Sabeti and colleagues at Harvard University and the Broad Institute describe a powerful computational strategy for systematically searching across the human genome to pinpoint signatures of positive selection.
Sabeti’s team has harnessed three complementary approaches that, when combined, highlight discrete regions that have undergone positive selection in human populations dating back some 30,000 years. By applying this “Composite of Multiple Signals,” or CMS approach, Sabeti’s team dramatically refined genome regions under positive selection, typically narrowing the interval to just tens of kilobases—a single gene—and just a handful of putative variants.
Iranian-born Sabeti, currently an assistant professor of organismic and evolutionary biology at Harvard University, has been working on methods to detect “the footprint of natural selection” since she was a Rhodes Scholar.
Positive natural selection is the means by which advantageous traits become more common in a given population, perhaps by increasing an individual’s ability to adapt to shifting environmental conditions and/or reproduce. Sabeti’s interest is in understanding what factors have shaped human evolution, especially genes that may impart resistance to infectious diseases such as Lassa fever and malaria in Africa and other places.
In 2002, working with Eric Lander, Sabeti developed a long haplotype method that could detect such footprints. “When [an] advantageous mutation arises in the population, it takes with it the entire [genomic] region around it,” she explains. But the candidate regions were very large, typically spanning hundreds of thousands of bases and dozens of candidate genes. “You couldn’t really make heads or tails of what was going on,” she admits. Later, while working on a review article for Science, Sabeti focused on the black box: “how are you going to isolate what you care about?” Of the five types of genomic patterns used by researchers to detect selection, three in particular were relevant to human population data (rather than, say, other species). In addition to her long haplotype method, the others were population differentiation and the derived allele effect.
In 2007, Sabeti published a paper in Nature in which she proposed combining these various approaches. Analyzing data from the international haploytpe map (HapMap) project, she reported an intriguing hit in the Asian population—a gene called EDAR, which governs hair, sweat gland and teeth formation.
Since setting up her lab as an assistant professor at Harvard’s Center for Systems Biology, Sabeti has set about combining those three approaches to produce a formal statistical measure of positive selection. The CMS method applies three independent approaches to identity signatures of positive selection, as follows:
- Long Haplotype: This method tracks the age of a mutation using genetic recombination as a clock. “You can essentially track the age of the mutation by how much breakdown has occurred on that haplotype,” says Sabeti. “By identifying outliers to the general pattern across the genome – a prevalent variant on a very long haplotype—that’s a signal selection.”
- Derived Frequency. A spike in the frequency of the new, or derived, allele, because it generally takes a long time for a new mutation to spread in a population.
- Population Differentiation. “If you scan the genome and you see a place where there’s a spike in a highly differentiated variant, that’s another signal selection.” The classic example is the Duffy blood group locus that protects against Plasmodium.
While the long haplotype method is in some ways the most powerful, it lacks the refinement of the other two methods in localizing the signal. The actual causal variant will have all three signals, whereas surrounding variants will show little overlap.
“The [CMS] test performed better than we ever expected, especially in the way it localizes spatially,” says Sabeti. “It’s able to distill the causal variant from the non–causal variants, but what’s beautiful about it is the breakdown of the correlation with the other variants is so rapid, you can localize to tens of kilobases.”
After performing simulations to test the CMS method, Sabeti’s “perfect grad students” and co-first authors—mathematician Sharon Grossman (daughter of Princeton economics professor Gene Grossman) and Russian computer scientist Ilya Shylakhter—applied it to more than 180 regions of the human genome deemed to be under positive selection, initially containing some 1500 candidate genes. CMS whittled this down to 64 hotspot regions containing a single gene, 35 stretches containing multiple genes, and curiously, 79 genome regions apparently bereft of protein-coding genes.
The Harvard team found potentially interesting functional gene groupings, such as sensory perception genes enriched in East Asia, immune system genes in West Africa, and metabolism genes in all three populations, as well strong hits for several known skin and eye color genes. One of several sensory perception genes under selection in East Asia is protocadherin 15, which functions in hearing and vision—the development of inner ear cells and maintenance of retinal photoreceptors. Its functional significance is unclear.
Another gene under selection in East Asia is the leptin receptor, which is linked to body mass index (BMI), blood pressure and obesity. “The [causal] mutation that we found is associated with an increased BMI, increased absorption of fat and nutrients. That suggests that 10,000 or 30,000 years ago, it was better to basically gain more weight,” Sabeti suggests. Meanwhile, in West Africans, two genes, variants affecting the expression of two genes, PAWR and USF1, have come under selection, presumably linked to infectious disease.
Sabeti is still tweaking the CMS test, but reckons it’s close to the theoretical limit. “It’s very hard to distill just by population genetic data alone what is the causal variant.” Sabeti is already extending CMS to sequence data from the 1000 Genomes Project, which will by definition contain the causal variant in any given region. But in many cases, pinning down the causal change will come down to functional investigation. There are also possibilities of extending the analysis further back in human history and to other species.
The software tools for CMS analysis are all homegrown, and should soon be available, perhaps wrapped into a program called Sweep for the long haplotype analysis.
This article also appeared in the March-April 2010 issue of Bio-IT World Magazine.
Subscriptions are free for qualifying individuals. Apply today.