Watching data management in genomics over the past 15 years has been a little like witnessing a Neoterics experiment. Each cycle of data floods, new and incompatible technologies, and integration issues has resulted, after the initial disruption, in simpler, faster, and more capable systems.
Sample metadata like species, strain, and provenance were stored in standard lab notebooks until the first insult—the Human Genome Project. Lacking teams of software developers or large software budgets, sequencing labs invented new server data management tools such as AceDB (enabled by open-source Unix systems). These systems were more accessible than existing laboratory information management systems (LIMS) systems, but scientists were still forced to try to communicate their needs to bioinformaticians, who had scant understanding of many-to-many relationships and primary keys. At the same time, client-side solutions like Excel spreadsheets became commonplace, affording scientists control of their data, at least until they tried to manage more than a few thousand records.
The next insults were the twin challenges of shotgun sequencing and the commercialization of genomics. Between commercial genomics firms (Incyte, Celera, etc) and the ramp up of capacity in pharma companies, the Neoterics of data management were suddenly presented with new dimensions. Pharma turned to commercial LIMS systems like SQL*LIMS and ActivityBase and/or designed entirely new internal systems. The typical server-side system involved Oracle sitting on a Solaris machine and serving up either thick Java clients or thin web form applications. While capable in terms of data robustness and stability, scientists struggled to work with database and software specialists. When successful, data could be properly structured, but workflow changes could require lengthy cycles of testing.
At the same time, client-side database management systems (DBMS) like Filemaker Pro and Microsoft Access became prevalent. These improved on spreadsheets by providing controlled data entry through forms and structured field definitions. While these systems could be developed using lab personnel, like the big Oracle systems, a downside was that many system elements needed to be changed to accommodate schema alterations. These systems had a problem with sharing as well, with crude file locking mechanisms often resulting in a single lab member responsible for all data entry.
The most recent insult, of course, has been the advent of next-generation sequencing (NGS). 454/Roche, Illumina, and SOLiD platforms have huge data needs. While process and sample metadata issues were initially dwarfed by the management of sequence and mapping results, more recently, sample and experimental data have begun to eclipse issues surrounding the data themselves, due to the increasing use of NGS machines for transcript and multiplexed runs. These machines have also pushed into smaller labs, where specialized software development is mostly impractical.
One way to meet this challenge is via wikis, especially the Mediawiki platform, which allows page contents to be defined directly by a non-technical author through a simplified markup. With the development of the Semantic extensions to Mediawiki, user-defined content has entered the realm of structured fields: users can define typed properties and group them into an entity with associated pages and forms. Like client-side DBMS, data definition can be done by a lab scientist, but, unlike those systems, multi-user and shared operation is a fundamental part of the software platform.
My BioTeam colleagues adapted structured Mediawiki to laboratory data management with the WikiLIMS platform (see, “WikiLIMS - Next-Gen Data Management,” Bio•IT World April 2008), a combination of Semantic Mediawiki and custom extensions used to support sample, library and experiment tracking. The simple property markup allows rapid construction (and destruction) of prototype pages, and new data types and updates to existing structures can be done without repeated calls to software specialists.
Recently, my colleague Michele Clamp has taken the wiki system to its logical conclusion. A downside of Mediawiki is the many layers of code base that are irrelevant to laboratory data management needs. A new product, MiniLIMS, brings user-defined properties and types into a smaller, more focused package that can adapt to lab needs using simple forms. A parsimonious data structure (property names and values along with associated object type and identity) provides flexibility for varied lab situations. A plug-in architecture enhances the base system with out-of-the-box second-generation sequencing support. Lab scientists now have the tools to manage data and adapt to new workflows. We look forward to the next insult!
Aaron Kitzmiller is the latest recruit to join The BioTeam, a bio-IT consultancy. Email: firstname.lastname@example.org