By John Russell
Sept. 11, 2008 | Eric Schadt, the scientific executive director of genetics at Merck Research Laboratories and Merck's Rosetta Inpharmatics subsidiary, is a leading proponent of integrative genomics who is decisively proving his point by delivering novel targets to Merck – roughly one-third of Merck’s new targets in diabetes and obesity.
Given the tough times technology advocates have encountered inside biopharma – largely because tech bets haven’t produced as expected – Schadt’s work is seen by many as the bow of an advancing icebreaker starting to shatter attitudes long frozen against predictive technologies. A mathematician by training, Schadt practices a brand of integrative genomics that builds predictive models of disease by integrating a wide variety of data (SNP, gene expression, etc.).
“I think there’s a growing belief that there’s such a massive amount of data being generated, dumped into the public domain, and the real question of the 21st century is, ‘Who are going to be the right kinds of groups to integrate that information in ways that best informs the biology at the earliest stages.’ I mean that is an information-based game, and who’s going to emerge as the information brokers, the Google of the biomedical and life sciences to tie that together?” he says.
One of the most widely-published authors on integrative genomics approaches, Schadt spoke recently with Predictive Biomedicine editor John Russell about integrative genomics and its impact at Merck.
JR: Let’s start with a snapshot of ‘integrative genomics’ and how it relates to genome-wide association studies (GWAS), which have become so popular?
Schadt: [GWAS] work is going to give you a signpost in the DNA that says something in this region is associated with the disease phenotype or drug response or whatever your trait of interest is. It doesn’t tell you which gene you should target to treat that disease. Even if it does lock you onto the gene, it doesn’t tell you [whether to] activate the gene, [or] inactivate the gene, and it doesn’t tell you what is the broader context in which that gene is operating and what’s the relevant tissue and how is it actually leading to this disease state or drug response phenotype.
We want to translate that into something that impacts public human health [and] to identify what are the genes involved, and more generally what are the networks involved that are driving these disease states. The integrative genomics idea is not to view things as one-dimensional; so genetics is one dimension, that’s important, but it’s not the end-all. The changes in DNA don’t directly cause obesity or diabetes. Instead they’re affecting these molecular networks, these gene networks or molecular networks of protein function or transcription or metabolite concentration, whatever the case may be. It’s changes in those molecular states that predispose you or increase your disease risk.
The idea is if we can fill in those middle layers, those intermediate phenotypes that respond more proximally to the DNA variation, then we can reconstruct a more accurate picture of how disease manifests itself. So the idea in integrative genomics is to take the DNA variation information as one dimension, but layer on top of that gene expression data or protein interaction data or DNA protein binding data – a lot of this was just described in a paper that came out [in June] in Nature Genetics* from my group – and see if we can integrate all of that information together into one mathematical model to construct networks that are predictive of these disease states.
JR: What kinds of things can you do with the model?
Schadt: You can use that model to ask, okay if all the genes involved in this network and the ways they’re interacting with these other [elements] give rise to disease, what are the best points in that network to target to treat the disease? [You can ask] what are the best nodes to track disease progression or drug response; it solves a lot of the biology upstream so you can make more informed decisions earlier on what you should be targeting, what you should be tracking to stratify patient populations into disease subtypes and responder groups.
JR: Can you give a concrete example?
Schadt: In the example presented in the PLoS Biology** paper on how this can work, we took a look at the big Wellcome Trust Case Control Consortium study, which is an amazing genetic study carried out on 17,000 individuals, seven different diseases, and Type 1 diabetes was one of the diseases they focused on. From that big association study they identified a number of loci for Type 1 diabetes. They had a followup paper in Nature Genetics that replicated the heck out of all the Type 1 diabetes associations they made in their initial study and then claimed all these novel loci for Type 1 diabetes.
One of the genes they identified explicitly as the most novel and most significantly associated with Type 1 diabetes was ERBB3. But what was the evidence they had that it was ERBB3? The only evidence they had was that the SNP they identified as associated with Type 1 diabetes in that region was closest to ERBB3; so it wasn’t in the gene region, it was just the closest gene to that SNP region. Then they just used their biological knowledge, what they thought they knew about the different genes in that region, and ERBB3 from the biology perspective seemed to make the most sense, but that’s a pretty subjective path.
We said, okay let’s look at all of the genes in the region and then instead of just looking at DNA variation information, let’s look at the activity of these genes in human tissues that are actually relevant to metabolic disease traits like obesity and diabetes.
JR: That’s when you profiled, I guess, 400 samples?
Schadt: Of liver, right. We’re looking at the expression levels of these genes, over the 400 samples, and asking whether the expression levels of any of the 50 to 100 genes that are in that region are correlating with this SNP like Type 1 diabetes correlates with this. When we did that, ERBB3, even though the gene was differentially expressed so it was active in this population, it wasn’t correlated. This activity was not correlated with the SNP. But a gene close by ERBB3, RPS26, that gene was very significantly associated. This activity was very significantly associated with the Type 1 diabetes SNP, so 40 percent of the in vivo variation in this gene, RPS26, was explained by this SNP that was associated with Type 1 diabetes.
Now that becomes kind of interesting because now you have this SNP that’s associated with disease that’s also associated with the activity of this gene that’s right in the region of the SNP. But even that’s not enough to sort of claim, okay it must be RPS26. Now we go into the integrative genomics stuff and say, okay, if we reconstruct combining all this information together and build these networks, we can look at any given gene and ask what’s the context in which this gene operates and is that context supportive of Type 1 diabetes? We did that for RPS26. We reconstructed a network; the network consists tens of thousands of genes. This is the power of having constructed these predictive models, we can go into that network and say, give me the location in the network where RPS26 resides and give me all of the genes in the region of RPS26 that RPS26 is communicating with.
JR: What kinds of data are you using to build the networks?
Schadt: We take the gene expression, we take the genotype data, and then we take other kinds of data that’s described in this Nature Genetics* paper that can help inform how genes might be related. We actually make a causal probabilistic network. Once the network gets constructed, it is represented in a database so you can search it in an in silico fashion. The database query for this network, constructed in this population, would be: Give me all of the genes that our network predicts RPS26 is directly affecting or genes that are directly affecting RPS26. So we come up with some subset of genes from that query. We can then ask: Are the genes represented in the RPS26 focus representative of Type 1 diabetes? The number one pathway that was enriched in that subset when we did a query against the KEGG database was Type 1 diabetes. When we did the same thing for ERBB3, we got no such enrichment. In fact, there were no pathways; none were predicted to be associated with Type 1 diabetes.
JR: Is there anything on the horizon (data types or technologies) you expect will enhance this integrative approach?
Schadt: It’s really going to be all about the intermediate phenotypes. So scoring not just DNA variation in a given population or gene expression, but looking at protein interactions and looking at DNA-protein binding and looking at methylation, differential methylation, and that kind of thing. As those technologies mature and make more comprehensive looks possible in each one of those dimensions, then tying those [pieces] together is going to lead to more accurate predictive models.
JR: How close are the technology advances needed?
Schadt: I think the ones for detecting DNA protein binding and methylation status are definitely close. Within the next few years you’ll start seeing large-scale, population-based studies being carried out where those things are being assessed. There already have been early papers on detection of methyl - genome-wide methylation through re-sequencing, bisulfite re-sequencing of genomes. [Also] genome sequencing technologies [will] get to the point where we can sequence entire human genomes in a day. The protein interaction is a tougher problem [whose solution] is maybe further off.
JR: What’s the stumbling block there, false-positives?
Schadt: Well that is a big problem, but the bigger problem is there isn’t a way to comprehensively screen protein interactions in a high-throughput fashion; where you could do it over thousands of individuals in multiple tissues. The whole problem with the protein interactions is you’re screening interactions in a very artificial space, like you’re doing say these two hybrid; you’re in a completely artificial setting and a completely artificial environmental context and you’re not assessing how these protein interactions are correlated throughout the whole system. [It’s] minimally useful in my opinion as far as reconstructing the network.
JR: Do you expect it will be some sort of chip technology to make it a high-throughput process?
Schadt: You’ve got to be able to score them – you want to be able to assess when this protein is interacting with this other protein, and what other protein interactions are going on at the same time. With the technology that exists now you don’t get that. You can only assess one interaction at a time and that’s of limited use. What’s also of limited use is you’re assessing that interaction in a completely artificial context. You need to be in a context that’s important for the manifestation of the phenotypes you care about, and that’s the power of the gene expression: We can get the relevant tissue at the relevant time where the disease is actually manifesting itself and we can monitor all the genes simultaneously, and you know that that’s very powerful.
JR: What are your thoughts on difficulties associated with inferring networks from gene expression data? I’m thinking of the DREAM Project where they put these blinded datasets out for researchers to infer networks from and the results were not especially good.
Schadt: It’s a great point and we, of course, have been involved in DREAM, helping that get going. If you’re just basing it on correlation-based information, gene expression data alone, we’ve written a number of papers showing that you are almost doomed to fail as far as your ability to make predictive models because there are so many problems; [e.g.] that the correlation-based information does not have any sort of systematic source of perturbation to enable you to infer how these things are causally related.
You just don’t have the right kind of information. As one example, using just correlation-based information, you run into this problem called Markov equivalents, where the finer the gene-to-gene connections that you want to make and add directions to those edges, in most cases the different sub-graphs you’re considering are what’s called Markov equivalents. No matter how much data you had in that setting you’re not going to resolve the right structure because statistically they’re indistinguishable. The key is you have to introduce a source of systematic perturbation to break that Markov equivalent. Groups have typically done that are through gene knockouts. But that’s still artificial and very one-dimensional. It’s not multi-factorial. You’re doing it one at a time or maybe two at a time.
What DNA variation allows you to do is it’s a naturally occurring source of systematic perturbation. Ultimately, it is DNA changes that drive how the system varies between species and between individuals within the species. So it is the ultimate source of causal perturbation, but it’s multi-factorial and occurring in a context where the complex phenotypes actually manifest themselves. This is the revolution of what we’re doing; [in the past] the DNA variation information has not systematically been leveraged as a source of perturbation to put together the causal network. We constructed a probabilistic causal network on the yeast data with just the expression data alone – we got to something that was completely not predictive. It could not predict the key causal regulators that we describe in that paper, but when we layered in the DNA variation information along with protein and DNA binding, we were able to make accurate predictions that we then tested prospectively.
JR: What impact has your work had on Merck’s pipeline?
Schadt: Obviously in the end we are motivated to apply all of these technologies towards that end. Just as one example, at present, roughly one-third of all targets being pursued by the diabetes and obesity franchise – these are programs with medicinal chemistry support, so this is serious right there, they’re actually developing a drug against these – come from this effort. And these are all completely novel, not characterized in the literature, just completely unexpected. You never would have locked onto them without doing the type of integrative thing that we did.
JR: Do you think your publishing these approaches will stimulate broader interest in them?
Schadt: There’s no question about that. Right now I get invited to more events – it’s just completely out of control where I get 150 to 200 emails a day. Even from companies. Pfizer and Amgen, these are companies inviting me to give talks to their key scientists on these types of approaches. So it’s pretty clear that the type of approach we’re taking is on the radar.
I think there’s a growing belief that there’s such a massive amount of data being generated, dumped into the public domain, and the real question of the 21st century is, who are going to be the right kinds of groups to integrate that information in ways that best informs the biology at the earliest stages? I mean that is an information-based game and who’s going to emerge as the information brokers, the Google of the biomedical and life sciences, to tie that together? So I think all – whether they believe the time is right now to do that ,or five, ten years into the future, I think you’re seeing a growing acceptance that it has to happen. There’s no other way to mine the data.
JR: Why is your group so successful at doing this?
Schadt: I get asked the question all the time: Why? You seem kind of academic. You’re publishing these amazing papers. Why are you at Merck? Why aren’t you at the Broad [Institute] or something like that? The response is that none of this work we’re doing could have happened anywhere else. It was really the support of Merck, having leadership like Peter Kim and Stephen Friend say, ‘We have got to better understand biology and we’ve got to understand it earlier in the process of drug discovery to increase our probability of success downstream.’ So they were hugely backing the very large-scale experiments you needed to carry out that cost real money to profile thousands of individuals, and thousands of experimental crosses and multiple tissues and genotyping at a genome-wide scale and clinically characterizing. That scale of work was undoable anywhere else.
JR: How about computing power?
Schadt: That’s all part of the equation. Merck also supported our building a world-class, high-performance computing supercomputer [7000 CPU, high-performance compute cluster, IBM Blade system] that still ranks among the top 200 supercomputers on the planet. I think the problems we’re pursuing are as computationally demanding as anything you’ll see in physics or climatology or any of the other fields. That’s what it’s going to take to solve some of these problems [such as] when we’re reconstructing these probabilistic causal networks, there are 10,000 genes interacting with one another. That is the n-body problem. The computing resource it takes to even be marginally productive, and that base is huge.
The other ingredient is absolutely a very heterogeneous group of people who buy into a common vision that it is through these integrative methods and generating this scale of data and integrating it to produce these models that we can mine. So it’s everybody. One of the groups is the network reconstruction group, which is mathematics, physics, computing type guys who are driven to integrate data that way. Another core component is the biology data mining. They’re hard-core biologists working hand-in-hand with the network group to help put the networks together and then mine them to actually find things of value.
Then there’s sort of the classic statistical genetics component that will do a lot of the raw processing of the data to feed to the network reconstruction group, and then the classic sort of bioinformatics group doing a lot of the data integration. The key, though, is they’re all working as one coherent team. It’s not like at the university where these are different departments and how do you get these departments to not just collaborate, but to work really intimately together towards this common, big science problem of finding out how best to target a given disease.
JR: How many people are in your group?
Schadt: It’s about 40. And they operate in a really enriched environment of an informatics group that’s world-class. They’re not part of the 40, but they’re certainly highly able to bring in data, and then the wet lab biology groups that are also well partnered with us. It’s both having this coherent group that’s very diverse, but it’s hooked into the right environment that’s very supportive of the vision that we’re driving.
JR: I’m sure you’re aware of the other companies trying to provide similar technology services and tools. Do you use them at all?
Schadt: Right now almost everything we’re doing is built by ourselves or in collaboration with other academic centers. But we absolutely are tracking all of the commercial efforts [such as] Gene Network Sciences, Genstruct, Entelos. Those are high on the radar. A lot of stuff is going on and we don’t have any illusion that we’re going to solve all the problems, and so we’re happy to collaborate. That’s part of the reason for being involved in the DREAM stuff and working with people like Andrea Califano and others to really track a lot of what’s happening in that space.
JR: Is part of your group’s strength proprietary algorithms, and would you ever test some of your algorithms against a DREAM challenge?
Schadt: We’ve started discussing [participation]. One of the other visionary things about Merck – and a lot of this is driven by Peter Kim - is that there is an advantage to being open and publishing the methods and the thinking because it not only helps validate that for Merck –because there are not so many groups within Merck that can evaluate the kinds of things we’re doing – [but also] external validation is very important.
Also the field is moving so fast, and if you’re not hooked up in talking to the world of people who are focused on this kind of problem and having them advise you and just being able to bank stuff off of those kinds of guys, I just don’t think you’re going to win. The strategy has always been to be very open with the kinds of methodological developments with how it gets applied. We wouldn’t say our main strength is a proprietary advantage on methods. I think actually the advantage is being open and highly connected to people like Andrea Califano and Daphne Kohler and Trey Ideker, all of these great systems type people, as well as others, and our advantage is in knowing how to apply that.
The value to Merck is how you apply that information resource, that platform, to problems. Merck cares a lot about that. So I would say that there are effectively no algorithms that we would claim are proprietary.
* Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks, Nature Genetics, published online, June 15, 2008.
** Mapping the Genetic Architecture of Gene Expression in Human Liver, PLOS Biology, May 2008.
This article first appeared in Bio-IT World’s Predictive Biomedicine newsletter. Click here for a free subscription.