In my 28 years of involvement with pharmaceutical R&D, I’ve seen no single issue that has generated more meetings, instigated more fervent debate, consumed more dollar investment, and generated solutions with less overall impact than what to do with our study data once a report has been finalized (especially the regulated GLP data). The average tox study report contains 500-1000 pages of data/results, often collected from 5-10 different LIMS systems that cross several departmental boundaries. In addition, increasingly these studies are conducted at CROs from around the globe or are licensed-in (or inherited via mergers and acquisitions) from other companies that had different processes and informatics tools. Finally, most large pharmas have study reports dating back at least to the implementation of GLP (1978) and many large pharma were still producing paper tox reports well into the late 1990s.
The veritable ocean of safety data at any one company (especially those with multiple research sites around the world) is itself staggering and when you think this is reproduced 10-20 times across the industry, the possible knowledge resource is mouth watering to anyone wishing to build in silico models of toxicity. Yet at most companies, to my eternal amazement, this information is so difficult to get at that it makes any kind of data mining or knowledge generation activity all but impossible. What is worse, even simple questions like “Have we seen this before?” are also nearly impossible to answer. Sadly and remarkably, most companies have very little access to their preclinical safety history. Here at CHA, this is the number one area that current or past heads of toxicology/preclinical safety tell us they would change if they had a magic wand. And with the advent of new high data volume safety assessment tools like high content analysis and toxicogenomics, the problem is getting worse at an alarming rate.
The Season for Success
In the mid 1990s, many in the preclinical safety field began to envisage a system that could be queried across all their safety studies to ask questions about their safety history and possibly generate new knowledge. A few brave souls began designing/building such systems based on the emerging field of data warehousing, and then started the long (and expensive) process of cleaning and moving all their LIMS data into the new warehouses. Some of these have now been in production for several years, yet the overall assessment of their value has remained fairly negative:
- They are overly complex and specific, often based on database schemas that are themselves mammoth and difficult to navigate and mine effectively (especially for users just looking to ask simple questions about their previous work).
- They are s-l-o-w, with some users reporting query times of hours to days.
- They were built by IT groups as part of some larger vision for a company-wide R&D informatics repository and therefore the needs and objectives of the preclinical scientists are not considered or prioritized appropriately.
- While cleaning and copying data from an in-house LIMS is expensive, there are tools that can make this semi-automated. But, for studies coming from CROs, predecessor companies, or from paper reports, there are no good options for doing this in an automated fashion. As such, the majority of data warehouse systems currently in use are missing large sections of the safety history that comes from these sources.
- Names and terminology is often different across LIMS and departments, and have themselves evolved over the years. This makes apples to apples comparisons difficult when looking to compare data across systems, sites and companies.
So, what is different now that leads me to believe this time it will be worth the effort?
- There is a growing sense that a solution generated by any one company will be expensive and not as useful as one generated on an industry-wide basis (including the LIMS vendors, i.e. that this is probably best done on a collaborative basis).
- There are several new tools on the market whose primary function is to gather and synthesize data from multiple sources and thus obviate the need to clean and copy data from many LIMS systems into one giant data warehouse.
- There are multiple efforts to set standard terminology to make queries across systems more effective and consistent.
- Hopefully, we will have learned from our past experiences.
Ernie Bush is VP and scientific director of Cambridge Health Associates. He can be reached at firstname.lastname@example.org.
This article also appeared in the November-December 2010 issue of Bio-IT World Magazine. Subscriptions are free for qualifying individuals. Apply today.