By Michael Swenson
June 11, 2002 | Nathan Myhrvold has been doing the rounds recently. One month after he was interviewed by Bio·IT World, there he was again, popping up in Technology Review. In that interview, he made one particularly provocative statement, that heavy reliance on relational databases like Oracle, SQL, and DB2 within the field of proteomics and genomics is "completely boneheaded."
Myhrvold did not elaborate, and as an entrepreneur he might not want to comment further. But let us suppose that his assessment is largely correct. What are some of the implications?
Until now, the use of relational databases within genomics and proteomics was treated as a given, with no one prepared to question whether this was an appropriate use of technology. That may be about to change. If an approach really is boneheaded, it will produce frustration for life science researchers, IT managers, and database designers and administrators alike. This frustration will take multiple forms, from being unable to use the data structure to address important questions, to being unable to design data structures that capture the most critical elements of knowledge produced by experimentation, to managing data types that are not well-suited to the relational database structure. Are these frustrations widespread, or will life scientists rise to defend their use of relational databases for genomic and proteomic research?
If Myhrvold is correct, the growing frustration with relational databases will spark a major new market opportunity. A significant change in how genomic and proteomic data are managed could provide room for new vendors to carve out a profitable niche. We might see startups emerging from academia with truly novel and elegant approaches to managing genomic and proteomic data. Independent software vendors would develop new tools and approaches. Perhaps biotechs and pharmas would create their own alternative. In this latter scenario, it is hard to know whether the innovation would remain hidden and deployed by its creators as a competitive advantage, or would come to market as a licensable technology.
I suspect that a truly new approach to managing genomic and proteomic data will come from a company that can wholeheartedly proselytize for a new approach without the baggage of the old. But as is often the case, the startup that figures it out probably won't survive intact to enjoy the long-term fruit of its labor. The entrenched relational database management system vendors will reverse-engineer the approach or simply acquire the startup that proves successful.
Don't Send Flowers Yet
Because the overall market for relational databases is so much larger than genomics and proteomics, or even all of the life sciences, it is unlikely that large database vendors will invest heavily in developing a better set of tools purely for managing genomic and proteomic data. It is more likely that they will partner with the developers of new solutions to deploy the new tools on top of relational databases, and thus secure their place as a component in the new solution rather than lose that business. Of course, if relational databases are as "completely boneheaded" as Myhrvold contends, then it will be much better to start building tools from the ground up rather than developing them as a layer on top of relational databases.
Assuming that genomic and proteomic researchers start adopting those new data management and analysis tools at a rapid rate, how disruptive would this scenario be to the database market in the life sciences? Although no vendor likes to lose share to a new technology, a substantial market for relational databases would most likely remain in other functional areas of life science research, such as clinical trials.
Even for genomics and proteomics, it is too soon to count relational databases out of the game. Pharmaceutical and biotech labs have substantial investments in relational databases as their core data management technology. Those companies have spent millions to build customized applications on top of their relational databases and invested thousands of hours in learning to use these systems. Sometimes careers are wrapped up in the continuation of these legacy systems. Relational databases have proved to be flexible and useful tools. Any new approach — no matter how elegant — will face significant barriers to adoption.
But assuming Myhrvold is indeed correct, what kind of data management and analysis tools might be more appropriate? Systems-oriented tools such as GenMAPP, which allows the user to combine a visual template of what is known and predicted about gene regulatory circuits or metabolic pathways and overlay that circuit diagram with gene expression data? Although an interesting visualization tool, this doesn't seem to require a change to the underlying data model. Perhaps one of the approaches of companies offering data integration? Not likely, since most of those vendors take the relational database as a given. Or, is it a matter of borrowing and adapting tools already deployed in other fields? In this case, interdisciplinary research centers are in the best position to develop the next generation of tools for proteomics and genomics.
Although I have raised far more questions than I have answered, I do predict that relational databases will be used in proteomic and genomic research for years to come. But competition and fresh approaches are always a welcome addition to any field.
Michael R. Swenson is a senior research analyst at IDC, and can be reached at firstname.lastname@example.org.