New CMS Method Pinpoints Positive Selection Signatures in Human Populations



By Kevin Davies

January 7, 2010 | Certain signs of trait selection among human populations are fairly easy to recognize -- skin pigmentation, height, shape, and the ability to digest lactose are some examples. The prevalence of deadly genetic diseases such as thalassemia and possibly cystic fibrosis can be traced to the advantage conferred to carriers of the faulty gene against infections by malaria and (perhaps) cholera or tuberculosis.

In a new paper published online this week in Science, Pardis Sabeti and colleagues at Harvard University and the Broad Institute describe a powerful computational strategy for systematically searching across the human genome to pinpoint signatures of positive selection. Sabeti’s team has harnessed three complementary approaches that, when combined, highlight discrete regions that have undergone positive selection in human populations dating back some 30,000 years. By applying this “Composite of Multiple Signals,” or CMS approach, Sabeti’s team dramatically refined genome regions under positive selection, typically narrowing the interval to just tens of kilobases -- a single gene -- and just a handful of putative variants.

Finding Footprints

Iranian-born Sabeti, currently an assistant professor of organismic and evolutionary biology at Harvard University, has been working on methods to detect “the footprint of natural selection” since she was a Rhodes Scholar. Positive natural selection is the means by which advantageous traits become more common in a given population, perhaps by increasing an individual's ability to adapt to shifting environmental conditions and/or reproduce. Sabeti’s interest is in understanding what factors have shaped human evolution, especially genes that may impart resistance to infectious diseases such as Lassa fever and malaria in Africa and other places.

In 2002, working with Eric Lander, Sabeti developed a long haplotype method that could detect such footprints. “When advantageous mutations arise in the population, it takes with it the entire [genomic] region around it,” she explains. But the candidate regions were very large, typically spanning hundreds of thousands of bases and dozens of candidate genes. “You couldn’t really make heads or tails of what was going on,” she admits. Researchers would weave a story around their favorite gene, without providing any compelling supporting evidence.

Later, while working on a review article for Science, Sabeti focused on the black box: “how are you going to isolate what you care about?” Of the five types of genomic patterns used by researchers to detect selection, three in particular were relevant to human population data (rather than say other species). In addition to her long haplotype method, the others were population differentiation and the derived allele effect. “While they’re tracking the same pattern, they’re pretty independent in what aspect they’re picking up,” she says.

In 2007, Sabeti published a paper in Nature in which she proposed combining these various approaches. Analyzing data from the international haploytpe map (HapMap) project, she reported an intriguing hit in the Asian population – a gene called EDAR, which governs hair, sweat gland and teeth formation.

Triple Score

Since setting up her lab as an assistant professor at Harvard’s Center for Systems Biology, Sabeti has set about combining those three approaches to produce a formal statistical measure of positive selection. The CMS method applies three independent approaches to identity signatures of positive selection, as follows:

·         Long Haplotype: This method tracks the age of a mutation using genetic recombination as a clock. “When a mutation arises on one chromosome, one haplotype background, it’s in complete correlation with all the other variants on that chromosome. With every generation that mutation’s around, there’s a higher probability that a recombination event will break down that haplotype. You can essentially track the age of the mutation by how much breakdown has occurred on that haplotype.” In other words, when a variant becomes common in the population against a highly conserved background, that signals a rapidly arising mutation. “By identifying outliers to the general pattern across the genome – a prevalent variant on a very long haplotype -- that’s a signal selection.”

·         Derived Frequency. In this method, Sabeti looks for a spike in the frequency of the new, or derived, allele. “Generally it takes a long time for a new mutation to spread in a population. The new mutations -- the derived state of the mutation -- are going to drift very slowly to high prevalence in the population. When you look at derived alleles across the genome, they’re going to be rare, the minor allele. But the new mutation we detect using the long haplotype signal will likely be one that happened in the last tens of thousands of years. It will rise in prevalence and take with it all the other derived alleles in the population. So you see a local peak where they all rise in prevalence.”

·         Population Differentiation. As human populations separated relatively recently, patterns of genetic variation generally look quite similar. “It’s very unusual to see something present in one population but absent in another,” says Sabeti. Positive selection, when it occurs, is generally a response to a localized environmental pressure (e.g. temperature, infectious disease), driving differences between populations. “If you scan the genome and you see a place where there’s a spike in a highly differentiated variant, that’s another signal selection.” The classic example is the Duffy blood group locus, where a mutation that protects against Plasmodium is at 100% frequency in parts of Africa, and absent elsewhere.

While the long haplotype method is in some ways the most powerful, it lacks the refinement of the other two methods in localizing the signal. The actual causal variant will have all three signals, whereas surrounding variants will show little overlap.

“The [CMS] test performed better than we ever expected, especially in the way it localizes spatially,” says Sabeti. “It’s able to distill the causal variant from the non–causal variants, but what’s beautiful about it is the breakdown of the correlation with the other variants is so rapid, you can localize to tens of kilobases.”

Time Signatures

After performing simulations to test the CMS method, Sabeti’s "amazing duo" of co-first authors -- mathematician Sharon Grossman (daughter of Princeton economics professor Gene Grossman) and Russian computer scientist Ilya Shylakhter -- applied it to more than 180 regions of the human genome deemed to be under positive selection, initially containing some 1500 candidate genes. CMS whittled this down to 64 hotspot regions containing a single gene, 35 stretches containing multiple genes, and curiously, 79 genome regions apparently bereft of protein-coding genes.

The Harvard team found potentially interesting functional gene groupings, such as sensory perception genes enriched in East Asia, immune system genes in West Africa, and metabolism genes in all three populations, as well strong hits for several known skin and eye color genes. One of several sensory perception genes under selection in East Asia is protocadherin 15, which functions in hearing and vision – the development of inner ear cells and maintenance of retinal photoreceptors. Its functional significance is unclear.

Another gene under selection in East Asia is the leptin receptor, which is linked to body mass index (BMI), blood pressure and obesity. “The [causal] mutation that we found is associated with an increased BMI, increased absorption of fat and nutrients. That suggests that 10,000 or 30,000 years ago, it was better to basically gain more weight,” Sabeti suggests. Meanwhile, in West Africans, two genes, variants affecting the expression of two genes, PAWR and USF1, have come under selection, presumably linked to infectious disease.

Sabeti is still tweaking the CMS test, but reckons it’s close to the theoretical limit. “It’s very hard to distill just by population genetic data alone what is the causal variant.” Simulations show that CMS identifies the causal variant flanked by just 20 proxies on average. Greater resolution could come from studying more subjects or complete sequence data (rather than just SNPs). Sabeti is already extending CMS to sequence data from the 1000 Genomes Project, which will by definition contain the causal variant in any given region. But in many cases, pinning down the causal change will come down to functional investigation. There are also possibilities of extending the analysis further back in human history and to other species.

The software tools for CMS analysis are all homegrown, and should soon be available, perhaps wrapped into a program called Sweep for the long haplotype analysis. Another question is to identify the causal variants in the regions that lack genes, which presumably affect the regulation of genes some distance away. Sabeti has unpublished data that several of these signals occur in regions harboring long non-coding RNAs (lincRNAs), a finding her team is following up with Broad Institute colleague John Rinn.

Sabeti concludes her paper by noting that as the selection data emerge, “It may be possible to observe much of evolution’s most recent handiwork and identify many of the functional adaptations that work to shape species.” In the meantime, she is busy recording her own handiwork, somehow finding the time to lay down new tracks for her band’s next CD.  

Editor's Note: Pardis Sabeti will be presenting her latest results at this year's Bio-IT World Conference & Expo on Thursday April 22.

Further Reading:

Grossman, S., Shylakhter, I. et al. “A composite of multiple signals distinguishes causal variants in regions of positive selection.” Science DOI: 10.1126/science.1183863

 

Click here to login and leave a comment.  

0 Comments

Add Comment

Text Only 2000 character limit

Page 1 of 1



White Papers & Special Reports

sgi whp 2
Managing the Modern Genomics Data Flood
Sponsored by SGI

Managing and storing the perfect storm of multi-disciplined data pouring from next generation sequencers and other omics instruments is a central challenge in life sciences. Discover in this paper how the SGI ArcFiniti storage solution, optimized for unstructured genomics and life sciences data can: 

  • Reduce costs, proactively protect data integrity, and deliver the high performance I/O required for genomics data processing and analysis.  
  • Effectively manage capacities from 156TB to 1.4PB as a disk based, integrated hardware and software platform 


sgi - whp 1
Turning Genomics Data into Practical Insight
Sponsored by SGI

With worldwide sequencing capacity approaching 13 quadrillion DNA bases annually turning genomics data into knowledge is a true computational challenge. Read this paper and learn how the SGI UV coherent shared memory platform can:  

  • Speed results time while cost competitively tackling the most difficult computational problems across all omics disciplines. 
  • Push performance by scaling to extraordinary levels, up to 256 sockets (2,560 cores, 4,096 threads) per single system (one OS image). 

Provide support for up to 16TB of coherent shared memory in a single system image enabling extreme efficiency across a wide range of compute demands. 



accerlys-logo_2012_wh
New Complimentary Market Survey…
Collaborations and Communications Within Drug Discovery Research
Sponsored by Accelrys
This survey was conducted by the Cambridge Healthtech Media Group in January, 2012. It was sponsored by Accelrys related to their HEOS initiative to gather valid information around externalizing collaborative research while improving communications in the cloud. With 310 qualified industry respondents the survey findings reveal useful usage and trends patterns.  An insightful follow-on discussion and webinar related to this survey, and the HEOS by Scynexis SaaS portal is also available on the Bio-IT World website for complementary viewing.
 


Job Openings

tessella logo 
Scientific Software Engineer
Boston MA
$70,000 to $95,000
 
Apply at http://jobs.tessella.com   

oxford nanopore logo 


Early Access Collaborations ManagersClick here to find out more and apply   

Oxford Nanopore's GridION technology, VP, Sales and Marketing Click to  Apply  

For reprints and/or copyright permission, please contact  Tim McLucas, (781) 972-1342, tmclucas@healthtech.com .