From predictive modeling to personalized medicine.
June 10, 2008 | BOSTON—Data warehouse architect David Jordan of the Lineberger Comprehensive Cancer Center at the University of North Carolina, Chapel Hill discussed how to design a bioinformatics data warehouse with Pentaho, which UNC selected to coalesce its tumor data. Despite the $250,000–$300,000 price tag, the project gathered information dispersed in numerous databases, while including separate silos for every tumor. The UNC team developed a “primitive ontology tool” to manage and map differing vocabularies for a uniform database. The prototype launched last fall, focusing on breast and lung clinical data, a tumor registry, and microarray data.
Les Jordan, technology strategist in Microsoft’s life science unit said Microsoft is keen on jumping into the world of open source with renewed gusto. “We want to play nice in the sandbox,” he said. As part of this trajectory, Jordan’s team hopes to take data and “move it seamlessly,” irrespective of platform. He talked about a life science “eco system,” a catchall term for silos of information and single points of integration—including research, development and manufacturing. A major unsolved problem is duplications.
Marco Ramoni, co-director of bioinformatics, Harvard-Partners Center for Genetics and Genomics, presented predictive models developed for assessing disease risks, such as the chance of stroke in patients with sickle cell anemia. Using a Bayesian network, Ramoni’s team set up tables based on genotypes to assess such risks, for example, the odds of a stroke within five years. The network scheme was extended to look at the general population, investigating myriad candidate genes. As Ramoni told Bio-IT World, “If you use one gene at a time, not all the genes together, you don’t get the predictive accuracy that you get by looking at everything together.”
Vadim Sapiro, the J. Craig Venter Institute’s VP for IT, discussed grid computing and related storage capabilities developed to offset IT problems impeding genetic research. The institute’s sophisticated grid design includes APIs for less-sophisticated users, as well as standard applications for programmers. One example of their efforts is the workflow for The Institute for Genetic Research (http://tigr-workflow.sourceforge.net), which has pipelines to support discrete processes that can be executed either sequentially or in parallel.
Frank Brown, Accelrys’ CSO, discussed the company’s use of scientific business intelligence to drive decisions. He illustrated Accelrys’ vision of scientific business intelligence with a pyramid topped by a “wisdom report,” followed by “knowledge conclusion,” “information results,” and “data measurements.” The bottom of the pyramid represents the most basic information, but then “as you go up the pyramid, analysis and aggregation are applied to refine the many data points into knowledge.”
Maurice Manning (Lilly Singapore Centre for Drug Discovery) described a system of cataloging and integrating information from a variety of sources to allow researchers to perform more powerful searches and gain better access to data. Rather than manually compiling information from the literature, internal data, and publicly-available databases, the center uses what Manning called “semantic integration.” This system involves developing an ontology of various terms used in these data sources and exploring the data with software that understands the terms and their relationships. In this way, users can perform complex searches across many databases or browse data on a certain topic. Manning noted that the center’s approach to semantic integration was more efficient and simpler than the standard approach.
Elgar Pichler, of Discovery Information at AstraZeneca R&D Boston, discussed a “protein thesaurus” that would also use semantic data. The system is open source, which could prevent companies from having to duplicate efforts to create such a thesaurus. Much of the data is extracted from the scientific literature, but Pichler encouraged the audience to support the semantic markup of data. “If you can, put me and other text miners out of business,” he said.
Michael Liebman, senior institute fellow and executive director of the Windber Research Institute discussed novel visualization techniques including linear distance plots, which can be used to represent a 3-D protein structure as a scatter plot. Liebman explained that no single view of a 3-D protein representation can show a protein’s entire structure, whereas a linear distance plot is not affected by rotation. He also discussed a unique family tree structure that was organized by time rather than generation, so events such as the influenza pandemic of 1918 and the polio vaccine could be included. Finally, he discussed a technology that was originally intended to distinguish live warheads from dummy ones. It is now being repurposed to differentiate cancerous from benign breast masses.
Jean Peccoud, at the Virginia Bioinformatics Institute at Virginia Tech, presented GenoCAD, a web-based application for building genetic constructs from a set of pre-made “parts.” The program was developed as a proof-of-concept over the past year, and the GenoCAD website says that the technology “will make the computer assisted design and fabrication of genetic systems a reality within a five-year time frame.”
Reported by Laurie Wiegler, Ryan DeBeasi
View webcasts of the 2008 Bio-IT World Conference & Expo keynotes and the plenary panel.
This article appeared in Bio-IT World Magazine.
Subscriptions are free for qualifying individuals. Apply Today.