John Quackenbush, Dana-Farber Cancer Institute
April 2012 | Ten years ago, I remember people talking about how having the human genome sequence was going to rapidly lead to a nearly complete understanding of human disease. But, in reality, that was about as unrealistic as thinking that a map of the world might make travel instantaneous. Nevertheless, new technologies coming from the genome project are transforming biomedical research by providing data on a scale that even a short time ago few of us could have imagined. As a result, biomedical research is rapidly evolving into an information science—one in which the winners in the race to understand disease are increasingly those best able to collect, manage, analyze, integrate, and interpret a wide range of data and information.
Genomic data has turned the world of computing on its head. In the past we used to talk about high-performance computing in terms of petaflops. Now the currency is petabytes—how much online storage we have, how much onboard memory can we access, how much bandwidth our connections have. The irony now is that, despite a lot of talk about the information superhighway, hard drives on FedEx trucks are often the most efficient way of transferring data. And the challenge is that the problem is only getting worse. Genomic data is only part of the problem. To interpret what is encoded in the genome, we need to access even more information on phenotype, disease outcome, gene function, biological pathways, and on and on. And once we start feeling good about solving the big data problem, we have to recognize we still haven’t addressed the big interpretation problem—how do we leverage what we know about the underlying biology to infer relevant, reasonable, robust predictive models that are clinically useful. It’s a brave new world—and that is what makes being a genomic explorer so much fun.
The problem we are all trying to solve, fundamentally, is trying to leverage the data we can generate and access to build robust predictive models that are relevant to the problems we are studying. We want to understand how mutations direct phenotype, or how perturbations lead to disease, or how new drugs might perturb a cellular network. And to address this problem, we are starting to recognize that big data by itself isn’t enough. We need truly enormous data on a scale that is frightening to think about. But I am convinced that the next large cohort study has to be everybody. Sure, people are going to do smaller studies and chip away at the problem. But just like we needed a Human Genome Project to get past the limitations of the gene-by-gene searches, we need the Humanity Genomics Project to begin to catalog the spectrum of genetic, environmental, and biological variation, and to begin to interpret what it all means. Only then will we have hope of solving many of the fundamental unsolved problems in biology.
Submissions may have been slightly edited for length or clarity.