The difference is in the layers. Eric Schadt believes that genome-wide association studies are just the beginning.
By John Russell and Kevin Davies
Oct. 8, 2008 | 2007 was proclaimed as the annus mirabilis for genome-wide association studies (GWAS)—a tipping point where scientists were able to identify scores of susceptibility genes for common diseases. But scientists at Seattle’s Rosetta Inpharmatics, a subsidiary of Merck, led by Eric Schadt, are aiming for something altogether grander.
Schadt’s assertion that GWAS “doesn’t really take you very far,” is bordering on heresy these days. His view, however, is that while an association study provides a telling signpost in the genome for a specific trait, it doesn’t necessarily pinpoint the causative gene to treat that disease. “Even if it does lock you onto the [susceptibility] gene, it doesn’t tell you: do I activate the gene? Do I inactivate the gene? And it doesn’t tell you what is the broader context in which that gene’s operating in? What’s the relevant tissue? And how is it actually leading to this disease state or drug response phenotype?”
And so Schadt, a mathematician by training who joined Rosetta Inpharmatics in 1999 partly out of frustration with the limitations of academic grant funding, has built an integrative genomics research program that begins by identifying key disease-related genes and then asks: What are the networks involved in driving those disease states?
Schadt sees genetics as simply one dimension of a much more complex problem. “Changes in DNA don’t directly cause obesity or diabetes. Instead they’re affecting these molecular networks—gene networks or molecular networks or protein function or transcription or metabolite concentration, whatever the case may be… It’s changes in those molecular states that go on to predispose you or increase your disease risk.”
If Schadt’s group can “fill in those middle layers, those intermediate phenotypes that respond more proximally to the DNA variation,” then it can reconstruct a more accurate picture of how disease manifests itself. And Merck, which owns Rosetta, is happy. “We can deliver completely novel, unexpected, well-characterized targets that nobody else is working [on],” says Schadt.
Schadt’s rapidly rising profile has certainly caught the field’s attention. Incoming requests are “completely out of control,” including up to 200 emails a day. He receives 100 lecture invitations each year. Even rival pharmas such as Pfizer and Amgen are asking him to give talks to their researchers.
It Pays to Layer
The idea, which Schadt has been nurturing since publishing a major paper with Merck vice president Stephen Friend in Nature in 2003, is to consider DNA variation information as but one dimension, and layer on top data on gene expression or protein interactions or DNA-protein binding. Then integrate that information into a mathematical model to construct networks that are predictive of these disease states.
Once you have a predictive model of a disease state, Schadt can say: If we have all the genes involved in this network, what’s the best node in the network to target disease treatment or drug response? “It solves a lot of the biology upstream, so that you can make better informed decisions earlier on about what you should be targeting, what you should be tracking to stratify patient populations into disease subtypes and responder groups… so you can make more informed decisions.”
Five years ago, Schadt’s group started a collaboration with Iceland’s deCODE Genetics, culminating in one of two landmark papers (this study looked at blood and adipose tissue in the Icelandic population) that appeared back to back in Nature in March 2008. There are about eight different projects ongoing in different tissues and diseases, including brain, cancer, and the liver (see below). Each one takes roughly a year to complete.
In a recent report in PLoS Biology, Schadt’s group looked at the “amazing” Wellcome Trust Case Control Consortium GWAS study from 2007 (Nature, 07 June 2007), which mapped genes for seven common diseases in 14,000 individuals.
In type 1 diabetes, the most significant association had come with the gene ERBB3. But Schadt noted that the only evidence was that the diabetes-associated SNP was situated closest to ERBB3—but it wasn’t actually in the gene region. Although from a biological standpoint ERBB3 made sense, Schadt’s team set out to examine all neighborhood genes and objectively monitor their activity in tissues relevant to diseases like obesity and diabetes.
When the team looked at gene expression levels across 400 samples, asking whether the activity of any of the 50-100 genes in that region correlated with the SNP association, ERBB3 struck out. But a nearby gene, RPS26, was significantly associated. 40% of the variation in RPS26 was explained by this SNP. Says Schadt: “Now you have this SNP that’s associated with disease that’s also associated with the activity of this gene that’s right in the region.”
But even that’s not enough. Next, Schadt’s group reconstructed the information in a network of tens of thousands of genes. He asked: what’s the context in which this gene operates and does that context support type 1 diabetes? “This is the power of having constructed these predictive models—we can go into that network and say, give me the location in the network where RPS26 resides and then give me all the genes in the region of RPS26 that RPS26 is communicating with.”
Schadt combines these data into a causal probabilistic network, which predicts not only what genes are interacting or correlated with each other, but which one is actually causing others to change. The DNA variation information is critical because it is the ultimate source of perturbation. Once the network is constructed, it is represented in a database that can be searched in silico.
By querying all the genes the network predicts are directly affected by RPS26 or directly affecting RPS26, the search produces a subset of genes. Then the team asks, are the genes representative of type 1 diabetes? In a query against the KEGG database, the type 1 diabetes pathway was the number one pathway enriched in this RPS26 sub network.
“That’s just an example of how we can show how this approach informs the discoveries coming out of the GWAS.” The integrative genomics approach provides information about what is going on in that region; what is the gene, what’s the network it’s operating in, and a functional mechanistic understanding of its effect.
According to Schadt’s new study in Nature Genetics, increasing molecular state information leads to more predictive models. That means scoring not just DNA variation or gene expression, but protein interaction, DNA protein binding, differential methylation, and so on.
Schadt says the tools for detecting DNA protein binding and methylation status are definitely improving. “Within the next few years, you’ll start seeing large-scale, population-based studies being carried out where those things are being assessed. So, there already have been early papers on detection of genome-wide methylation through re-sequencing—bisulphite re-sequencing of genomes.”
As that technology matures, the detection of methylation and DNA-protein binding is also doable. But protein interaction is more difficult, as there isn’t a way to comprehensively screen protein interactions in a high-throughput fashion for thousands of individuals in multiple tissues. Screening protein interactions is too artificial, he says, so researchers can’t assess how those interactions are correlated throughout the whole system. Schadt would like to be able to assess when a protein is interacting with another protein and what other interactions are occurring simultaneously, but that’s not possible with current technology. “You can only assess one interaction at a time and that’s of limited use,” he says.
Schadt says of reconstructing networks based exclusively on gene expression data (his first attempt was a 2005 Nature Genetics paper), “you are almost doomed to fail as far as your ability to make predictive models because there are so many problems.” Correlation-based information does not enable researchers to infer how things are causally related.
“In most cases, the different sub graphs you’re considering are what are called ‘Markov equivalents.’ So, no matter how much data you had in that setting, you’re not going to resolve the right structure because statistically they’re indistinguishable.” The key, he says, is to introduce a source of systematic perturbation to break that Markov equivalent. Groups have tried doing that gene knockouts or chemical perturbations, but Schadt says that’s still artificial and one-dimensional.
By contrast, “The DNA variation data provide a naturally occurring source of systematic perturbation. Ultimately, it is DNA changes that drive how the system varies between species, between individuals within the species. So, it is the ultimate source of causal perturbation, but it’s multi-factorial and occurring in a context where the complex phenotypes actually manifest themselves.”
“This is the revolution of what we’re doing—the DNA variation information has not systematically been leveraged as a source of perturbation to put together the causal network… It’s only by including that kind of information that you can get to something predictive.”
When Schadt’s team tried constructing a probabilistic causal network on those yeast data using just gene expression data, it got “something that was completely not predictive. It could not predict the key causal regulators that we describe in [the Nature Genetics] paper. But when we layered in the DNA variation information along with protein DNA binding, we were able to make accurate predictions that we then tested prospectively.”
Billion Dollar Question
The billion dollar question is whether Schadt’s efforts will have a meaningful impact on Merck’s drug discovery and development programs? “Obviously, we are motivated to apply all of these technologies toward that end,” says Schadt. For example, roughly one-third of all the novel targets being pursued by the diabetes and obesity franchise—programs with medicinal chemistry support that are actually developing a drug—have stemmed from his group’s efforts. “These are all completely novel, not characterized in the literature, just completely unexpected. You never would have locked onto them without doing the type of integrative thing that we did.” And that’s exactly the goal, of course.
“There’s such a massive amount of data being generated, dumped into the public domain. The real question of the 21st century is, who are going to be the right kinds of groups to integrate that information in ways that best informs the biology at the earliest stages? [It] is an information-based game and who’s going to emerge as the information brokers, the Google of the biomedical and life sciences, to tie that together?”
Whether people think the time is right today or five or ten years from now, Schadt says, “I think you’re seeing a growing acceptance that that has to happen. There’s no other way to mine the data.”
Many people ask Schadt why his group has been so productive of late. “I get asked the question all the time. ‘Why? You seem kind of academic. You’re publishing these amazing papers. Why are you at Merck? Why aren’t you at the Broad or something like that?’” Schadt responds that, given the scale of his group’s studies, often involving thousands of patients (or mice) and multiple tissues, this work couldn’t really have happened anywhere else. “It was really the support of Merck. Having a more visionary company and leadership like Peter Kim and Stephen Friend say, ‘Man, we have got to better understand biology and we’ve got to understand it earlier in the process of drug discovery to increase our probability of success downstream.’”
Man and Machine
Aside from managerial support, there is a long list of credits. The computing architecture is a key aspect of this endeavor. “That’s all part of the equation,” says Schadt. “There’s the money and the will to generate that scale of data, but then there’s processing that data. What it takes to process, to build these kinds of models is non-trivial on the computational side.”
Merck supported the building of a world-class, high-performance compute cluster—a 7000-CPU IBM Blade system—that ranks among the top 200 supercomputers. “It wasn’t just cobbled together,” says Schadt. “It was a very intentional design. It’s a huge disk storage system integrated with [all] the software that’s built around that to handle queuing and load balancing.”
“The problems we’re pursuing are as computationally demanding as anything you’ll see in physics or climatology or any of the other fields... When we’re reconstructing these probabilistic causal networks, there are 10,000 genes interacting with one another. That is the n-body problem.”
Besides the raw computing resources, Schadt pays tribute to a diverse group of about 40 researchers “who buy into a common vision that it is through these integrative methods and generating this scale of data and integrating it to produce these models that we can mine.”
There’s a network reconstruction group consisting of “mathematics, physics, computing type guys who are just driven to integrate data that way.” The biology data mining group consists of “hard-core biologists working hand in hand with the network group to help put the networks together and then mine them.” Other groups handle classic statistical genetics, data processing to feed the network reconstruction group, and bioinformaticians doing data integration. “The key, though, is that they’re all working as one coherent team.” Schadt also gives credit to Merck’s “world class” informatics infrastructure that brings in the data and the wet lab biology groups that are close partners.
Schadt’s group builds most of its tools either internally or in collaboration with other academic centers. But commercial efforts from companies like Gene Network Sciences, Genstruct, and Entelos are high on his radar. “We don’t have any illusion that we’re going to solve all the problems,” he says. That’s one reason his group is involved in the DREAM Project, for example.
Schadt says that Merck—largely through Kim—is encouraging an open approach to publishing “the methods and the thinking,” because external validation complements what Merck can do internally. With the field moving so fast, it’s crucial to receive advice from colleagues. “The strategy has always been to be very open with the kinds of methodological developments, how it gets applied and … not think we’re going to solve everything.”
He particularly values some “great systems type people” in academia such as Andrea Califano (Columbia), Daphne Kohler (Stanford) and Trey Ideker (UCSD). Obviously the openness has to stop somewhere. Says Schadt: “Those things that Merck decides to place its bets on we’re not going to be out talking about, because that’s the value Merck’s going to realize. But everything that helped inform that decision you want to be open.”
One of the advances Schadt is most excited about is to “model the entire system at the molecular level to get at networks that actually predict physiological states.” Most of the current work focuses on a single tissue. But Schadt’s group is starting to look at multiple tissues. “For example, in mouse, we can look at six to nine tissues from the same animal over hundreds of animals and start figuring out how these different tissues are communicating with one another via the molecular network.”
For example, “how do changes in the hypothalamus induced by DNA variation change molecular states in the brain that go on to cause molecular state changes in adipose tissue or pancreas or liver or stomach?” Work on such cross-tissue networks will be one of the more exciting papers coming next: how behaviors in one tissue affect behaviors in another to drive toward a disease state.
For now, these models are mainly descriptive, but Schadt is driving toward applying the same techniques for single tissues with multiple band tissues. “We do have these concepts emerging of what we call module-to-module causality, where we’re actually fitting models. So, a module would be just a sub network within a tissue and when you perturb that module with say a DNA change and that module changes, can we predict what other modules in other tissues are going to change?” A recent paper in collaboration with a group at the University of Wisconsin (Genome Res. 2008 May;18(5)) began to explore that idea—a network diagram “showing module-to-module or sub network-to-sub network connectivity both within and between tissues.”
The Rosetta team hopes to be able to modulate that system to see what changes. Says Schadt: “We’re not quite there yet, but I think that over the next couple of years that’s exactly where we want to be.”
Zhu, J. et al. “Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks.” Nat Gen 2008
Schadt, E.E. et al. “Mapping the genetic architecture of gene expression in human liver.” PLoS Biol 2008
Chen, Y. et al. “Variations in DNA elucidate molecular networks that cause disease.” Nature 2008
Emilsson, V. et al. “Genetics of gene expression and its effect on disease.” Nature 2008
Schadt, E.E. et al. “An integrative genomics approach to infer causal associations between gene expression and disease.” Nat Gen 2005
Schadt, E.E. et al. “Genetics of gene expression surveyed in maize, mouse and man.” Nature 2003
This article appeared in Bio-IT World Magazine.
Subscriptions are free for qualifying individuals. Apply Today.