With assists from Oracle and InforSense, Dana-Farber’s John Quackenbush finds ways to mine vast patient data collections to better understand human cancer.
By Alissa Poh
March 24, 2009 | How does a physicist wind up at the vanguard of translational medicine, bridging genomics, bioinformatics, and IT in an effort to shed light on cancer biology? That’s among the chief responsibilities of John Quackenbush, professor of biostatistics and computational biology at the Dana-Farber Cancer Institute (DFCI) in Boston. An affable personality with slightly greying, shoulder-length hair, Quackenbush is a theoretical physicist by training. He rose to prominence during an eight-year stint at The Institute for Genomic Research (TIGR), founded by Craig Venter, developing and sharing a range of software tools and databases for microarray analysis (see, “John Q: Life After TIGR”).
“People think I love building databases; actually, I hate it,” he says from his roomy office. “I’m never going to show up on the front page of the New York Times with the headline “Quackenbush Builds Integrated Database.” If I get there, it’ll be because of the discoveries such integrated databases allow me to make. So you could say I want to be able to build the tools that will allow me and everybody else here to show up on the front page of the Times.”
Whether Quackenbush’s efforts become fit to print remains to be seen, but the work he is spearheading at DFCI will likely be felt far and wide in the field of translational medicine.
Bringing Bioinformatics to Cancer
In 2002, Quackenbush was considering leaving TIGR, which had begun focusing on microbial sequencing and annotation, whereas his interest was increasingly turning to the clinical space, beginning with a collaboration with Timothy Yeatman at the H. Lee Moffitt Cancer Center in Tampa, Florida.
Back then, the trouble facing interdisciplinary scientists like Quackenbush was that no one knew where they fit in the traditional hierarchy of academia.
“We really want you here, we just have to figure out where to put you,” Quackenbush would hear. “People were very excited about my work in genomics, but they didn’t know quite what to do with the bioinformatics part.”
He interviewed at several places, even turning down a position as professor of urology at the University of British Columbia in Vancouver, before accepting an offer from the Department of Biostatistics and Computational Biology at DFCI, moving to Boston in 2005. (He also holds a faculty appointment at Harvard’s School of Public Health.)
Quackenbush calls DFCI “one of the most progressive places I’ve seen in terms of thinking about ways to advance science. And I can honestly say it’s the least pathological place I’ve ever worked.” New in Boston and driving to work one morning, he spotted a girl selling lemonade by the roadside. She told him one of her classmates had been treated at “the Farber” and her class was raising money for the Jimmy Fund [DFCI’s charity]. The level of community support for and patient involvement in cancer research is incredible, he says.
Before hiring Quackenbush, DFCI had recognized that even with genomics becoming democratized, there were opportunities to do new things that were cross-disciplinary. The institute decided to adopt an entrepreneurial model, with the idea of establishing and awarding five years of start-up financial support to research centers that would work across different departments.
“My message during [interview] presentations was consistently about data integration and its value in propelling science forward, which really resonated here,” Quackenbush says. Aside from his own research, he has devoted much of the past three years to building the infrastructure necessary for his other mandate: creating a Center for Cancer Computational Biology at DFCI. That could be viewed as a service, but “it’s a service to allow me to the things I want to do,” he says.
Quackenbush reasoned that the success of such a center would require integrating genomic information with clinical data, as one step toward improving cancer diagnosis and tailoring treatment for individual patients. There would, however, be stiff challenges in linking microarray data with not only clinical information but also public archives such as GenBank, OMIM, and HapMap, while ensuring quality control and reliability.
“Web services are all very well, but you’re relying on someone else to maintain the data and not change their protocols,” he explains. “Even for GenBank, where things are supposed to be fairly stable, you frequently see them violating their rules for data entry and standards.” Cloud computing could not possibly work in this space, he adds, given the confidential nature of much of the data. Rather than build a large web services model, Quackenbush elected to integrate all this information in a database unique to DFCI.
Around then, Quackenbush crossed paths with Edie Weller, a senior research scientist in his department, during a faculty meeting. Weller, the lead statistician for multiple myeloma, was trying to merge data from different sources—relational databases, raw text files—a nightmarish and time-consuming process involving many Excel spreadsheets. It was particularly frustrating, when designing whole-genome gene expression studies of chemotherapy response for this disease, that she and her colleagues couldn’t obtain immediate access to data on their own patient samples, even for information as simple as sample storage location.
“I knew there had to be better ways of merging information and allowing investigators direct access to the data,” Weller says. “So although I was initially hesitant to bring it up at the meeting, I finally described how we were linking our data, to John. He looked at me like I was crazy.”
“It was nuts, madness on multiple levels,” Quackenbush recalls. It also clearly illustrated the need for merging different data sources together in cancer research. The multiple myeloma researchers invited him to use their case as a framework for creating a data integration warehouse that could potentially be extended to other types of cancer.
Oracle, with its expertise in capturing and managing clinical data, came to mind immediately as a potential partner in this data integration venture. “There was no point in reinventing the wheel,” Quackenbush says. After submitting a proposal for one of the enterprise software giant’s commitment grants, he was quickly offered $1 million spread over two years rather than the three he had requested.
And it was about more than just a financial grant. “We also volunteered technical and subject matter expertise to jump-start Quackenbush’s plans for data integration at DFCI,” says Vijay Pillai, director of strategic planning and business development at Oracle’s health sciences division.
Quackenbush, Weller, and Joseph White, the lead database developer in Quackenbush’s group, attended half-a-dozen workshops over a couple of months, led by Steve Jepsen, Oracle’s senior director for health industries. “We focused a lot on data security and scalability [during the workshops],” Pillai says. “In moving beyond multiple myeloma, you want to be able to adapt to such growth very dynamically, rather than rebuild your environment. From a security perspective, although clinical studies in different therapeutic areas might reside in one data layer, you want to be sure that the investigators can still only access information they’re authorized to. So we helped Quackenbush’s group think about these design implications.”
The result of these workshops and additional brainstorming was a brand new translational research infrastructure utilizing Oracle’s Healthcare Transaction Base (HTB) and Fusion Middleware components. HTB creates an integrated data repository, whereby researchers access clinical and patient sample information via a single platform and can seamlessly connect this with experimental data. The Fusion Middleware suite, on the other hand, lets them get to their data securely from any location. And a third Oracle component, the BPEL (Business Process Execution Language) Process Manager, allows for safe and, if necessary, multiple transfers of complex clinical data across the entire infrastructure.
At the same time, Oracle hand-picked a long-term partner in the intelligent software business—UK-based InforSense—to add what Pillai calls their “great visualization capabilities” to the collaboration.
“When you bring different data sources together, you need not just analytics but visualization tools—such as charts and correlation graphs combining thousands of data points—on top of the base integration layer,” he says. “We decided that InforSense’s applications could help keep us completely in sync, in terms of data integration and interpretation.”
Experts at InforSense suggested ClinicalSense, a web-based tool for clinicians and researchers to get summary statistics about patient populations by fashioning row-and-column matrices out of patient sample attributes. “You could construct a query where ‘regimen response’ represents the rows, while ‘sample count’ and ‘sample type’ are chosen for the columns,” explains Mick Correll, InforSense’s director for clinical solutions. “This would result in a cross-tabulation matrix showing the number of available samples broken down by type—tissue or blood, for instance—and grouped according to the patient’s response to a particular regimen.” In other words, users can build more sophisticated queries by defining a hierarchy of attributes, which then enables them to “drill down” into the results matrix, further stratifying the population.
“It provides, I think, a very rich and interactive web experience that makes the data come alive in the hands of clinicians and researchers,” Correll says.
ClinicalSense leverages InforSense’s next-generation business intelligence platform, besides providing an advanced clinical data model. It’s both easy to use and intuitive, thanks in large part to direct feedback from clinicians throughout the product development process. “The multiple myeloma study at DFCI is precisely the type of problem ClinicalSense was built to solve,” Correll says. “It provided the right balance between out-of-the-box functionality, and flexibility that will enable it to adapt to changing needs in research.”
Like most large-scale collaborations, this one wasn’t without its hitches, particularly with regard to the people responsible for Information Systems (IS) at DFCI. “With these folks, whenever you ask them a question—no matter how benign—their first answer is always ‘No,’ since no access is the most secure access,” Quackenbush says, only half-jokingly. “The word ‘fragmentation’ has been used to describe this whole problem of having data in different places; I like ‘Balkanization’ instead, because not only are the data being broken apart, there are all these people actively fighting against integrating it. We spent more time and effort negotiating transferring data into this warehouse than we did actually building the warehouse.”
“There was a lot of confusion with IS about the scope of this project; how it fit or conflicted with other IS initiatives,” Weller says. “With how quickly research moves, we felt it was imperative to have individuals who understand the biology, as well as systems aspects, working on the project. Once we discussed this in detail and described our data security model to the IS, things were much easier.”
Nevertheless, Weller adds, such regulatory issues—especially those involving redistribution of data collected from different hospitals—are hardly minor. “I think the time we spent sorting these out will benefit not only our myeloma project, but other initiatives too,” she says.
Upon overcoming these hurdles, the collaborators rolled out their prototype for an integrated data warehouse in May 2008. The warehouse’s full production system has been up and running since November, after two training sessions—one for statisticians, data managers and researchers; the other for clinicians—to teach them the art of accessing and querying the database.
“In both cases, I think the new system was well received, and the feedback we’ve had since has all been very positive,” says Correll, who led the training. “I’m sure modifications will be necessary as it moves forward—this is research, after all—but we’re clearly on the right track.”
Several of Quackenbush’s colleagues, including a few skeptics, were invited to sit in on both sessions. “I remember Beverly Ginsburg-Cooper [senior vice-president for research at DFCI] grabbing me by the sleeve, five minutes into the second session,” he recounts with a smile. “This was after she had watched us go from nothing to a group of patient samples with certain clinical characteristics based on karyotype and trial response, to their gene expression profiles, to a set of genes correlating with response, to PubMed records describing the genes—all as ad hoc queries.” It dawned on the skeptics how this would excite young people doing research at DFCI. “This system presents information in a way they’re comfortable with; they feel invested in the process and better able to participate in data analysis, to see how they can drive things forward,” Quackenbush adds. Ginsburg-Cooper even called it “transformative for research.”
DFCI’s first integrated data warehouse has been constructed architecturally so there’s a path to move forward, Quackenbush says. The institute will pour $8 million into his cancer computational biology research center over the next five years, which he considers “less than we need, although the center’s built on a model I think we can expand.” He is thus seeking additional funds to accomplish his goal of moving the data integration system beyond multiple myeloma, with breast cancer as the next likely candidate area.
As well, Quackenbush recently applied for a grant that, if approved, will include funding to create a pilot implementation for data from the Nurses’ Health Study (NHS) at Harvard, the largest and longest-running investigation of factors influencing women’s health. And he has been communicating with multiple myeloma researchers at the University of California, San Francisco, about the possibility of a mirror installation at their end to facilitate data sharing between both groups.
“I don’t know where this will go next, to be honest, but it’s likely to go somewhere,” Quackenbush says. “Our successful collaboration with Oracle and InforSense has put us in a position to think about reaching beyond DFCI and gradually pulling in a lot of Harvard’s multi-institutional spores and their data collections. So there’d be some method to the madness.”
He greatly appreciates how DFCI nurtures the importance of continued research. “You always hear about Harvard eating its young,” he muses. “I came here a little worried that I was going to face all these prima donnas. It’s not to say that there aren’t those with pretty big egos, but there’s a really high level of collaboration here, which is both astonishing and impressive.”
Quackenbush embraced interdisciplinary research early on, and now observes that many scientists in his area of interest are spanning their boundaries, which he definitely endorses. “A computational model is just that; a model plus validation is a discovery,” he remarks. “People are really trying to drive the latter, rather than being held captive to someone else’s experiments. It isn’t true of everyone in the field, but I think it’s an emerging trend; a systems biology approach that is evolving naturally out of genomics and bioinformatics.”
John Q: Life After TIGR
John Quackenbush made his foray out of physics in 1992, when he became intrigued by an initiative from the National Human Genome Research Institute (NHGRI) seeking experts outside biology to work on the Human Genome Project. He spent two years working on the physical map of human chromosome 11 at the Salk Institute, before being hired to set up large-scale sequencing at Stanford’s Human Genome Center. When promotion prospects dimmed, Quackenbush headed east to Maryland and Craig Venter’s The Institute for Genomic Research (TIGR) in 1997. “The mandate for me at TIGR—going beyond the genome and establishing a microarray laboratory— was really my growing interest,” he says.
At TIGR, Quackenbush quickly recognized that there was a woeful lack of tools for collecting, managing, and analyzing the reams of genomic data being amassed. “Our first publication on gene expression in colon cancer included nine arrays, and it was a year’s worth of work just to analyze and generate that data,” he says. “It really opened my eyes to the challenges and problems with assumptions people have made about biological systems.”
Quackenbush recalls early microarray experiments showing that expression of cyclin A1 was a much more appropriate choice of housekeeping gene than the traditional GAPDH, which fluctuated sharply. “What you start to understand,” he elaborates, “is that assumptions in biology are often based on little more than gut feelings or historical approaches, and there’s nothing better than data to drive a real understanding of what’s going on.”
Piles of data are essentially worthless without proper management and analysis tools. Given his physics background, however, Quackenbush was comfortable proceeding where most genomicists feared to tread. He continued to write his own data analysis software, creating databases and a variety of open source software tools to help manage the voluminous data being generated at TIGR.
While there, Quackenbush also participated in the scientific workgroup that put forth the MIAME (Minimum Information About a Microarray Experiment) standards. The goal: to allow uniform recording and reporting of microarray data, with the overarching purpose of facilitating the development of databases, public repositories and data analysis tools. It might not be perfect, he says, but these standards have proved handy over the years, especially for finding and correcting errors in published data.
For example, at DFCI he and his colleague Aedin Culhane recently refuted another group’s claim to have identified a lung metastasis signature in breast cancer. “Some of the genes they found resonated with us, so we compared their samples with gene signatures in our database and showed that all of the lung metastasis samples fell into the basal-like subtype of breast tumors,” he says. “Such tumors are known to have the highest propensity for metastasizing to the lung. What we recognized, looking at this paper, was that they were really suffering from confounding facts. They weren’t predicting lung metastasis; their signature was much more highly predictive of the basal subtype than anything else.”
Then and now, Quackenbush’s creed is that his software tools must be available in the public domain. “It’s my mantra: If you’re creating tools, they have to be useful, and they have to be used,” he says. “If they’re not either useful or used, the overall impact is going to be small; ditto if they’re just one but not the other.”
Quackenbush was considerably irked, then, when TIGR decided to go with licensing agreements for said tools instead. He remains convinced that attempting to write and market software tools in the genomics space is scarcely a winning proposition. Most of the companies that started out along this path have since gone belly-up. “This was hardly in the spirit of what we were trying to do; we were working to advance the science, rather than create tools,” he adds.
This led to his staging—along with two like-minded TIGR colleagues, Steven Salzburg and Owen White—what they humorously called the Open Source Revolution, in 1999. “We decided that if just one of us did it, he’d probably be canned; if all three were involved, [TIGR] couldn’t do anything,” Quackenbush grins. The trio’s efforts to release software to the public domain were mostly welcomed at TIGR, since this eliminated the cost of prying licensing agreements from potential users, and increased the number of successfully funded grants. Nor did it hurt that TIGR was then experiencing a lull from soap-operatic drama, with Venter occupied at Celera Genomics.
But Venter eventually returned, and between the ensuing chaos and TIGR’s shifting climate, Quackenbush decided to make his escape. He joined DFCI on March 14, 2005—a date he remembers well for two reasons. His son Adam was born exactly one year later, and March 14, as good geeks know, is also Pi Day.
This article also appeared in the March-April 2009 issue of Bio-IT World Magazine.
Subscriptions are free for qualifying individuals. Apply today.