Bringing Light—and Value—to Dark Data in Research & Design

Contributed Commentary By Graeme Dennis

June 29, 2020 | Data are often recognized as a top asset of organizations in any field, including the biosciences, but the way it’s handled may not adequately reflect this. One explanation for this mismatch is that the amount of data being produced every day is enormous—but not necessarily accessible. Forrester estimates that 60-73% of all data within an organization go unused for analytics, instead relegated to storage and quiescence. Dubbed “dark data,” there’s not much dark about it: the data are simply information that companies, in their regular course of business, collect, process, and store.

There are several reasons for the volume of dark data: local storage, an explosion in the number of users at all levels of computer literacy, and essential data security. In the sciences, particularly the global pharmaceutical market, traditional boundaries have begun to dissolve, which also creates volumes more data. For instance, the popularity of alternative medicine in Western markets is mirrored by the massive Chinese “Healthy China” universal coverage initiative, and a Japanese government focus on increasing the use of generics to reduce treatment costs. Meanwhile, the demands on informatics tools must now keep pace with new biotechnology including immunologic, antisense mRNAi, and cell-based therapies, and the global demand for biologic medicines and vaccines.

The R&D driving all these advances is generating a staggering amount of data. The key question is, how can our industry make these data accessible over the long-term, and treat it as a free-standing and central asset, reflecting its intrinsic value? Leadership and IT departments frequently lament the fact that the same platform can’t serve exploratory research, preclinical pharma, and manufacturing. The question is particularly relevant to the biopharmaceutical industry, where the data may ultimately contribute to important new therapies. As we become nimbler in delivering workflows and more willing to venture outside traditional software classes, we’ve begun to solve the problem of standalone systems across siloed disciplines.

Though some dark data may have great utility, not all do. Some simply exist and accumulate, because there is no suitable system to house it. For instance, many in vivo preclinical results generated by contract research organizations (CROs) are in portable document formats unsuited to analyzing, or in email, which is an unshared environment by design. This choice is a matter of convenience, but not only are the data unsharable, but to the organization, it is invisible. Even if exposed, many scientists would never make use of the data without precise knowledge of how, when, by whom, and under what conditions it was gathered, preferring instead to re-run the study, which requires time, money, and resources.

There is a spectrum of positions on dark data: from not exposing any of it, to strict data retention policies that resemble “hoarding.” Neither is likely the answer. Our goal is not simply to derive value from dark data, but also to store less of it—we can’t assume it’s best to keep everything just because it’s inexpensive to do so. The fact that much data persists unexposed may speak to its usefulness.

Of the data that are retained, the focus should be on providing contextually rich data. Legacy methods of recording and collating data are error-prone, but they also fail to provide full context. Sometimes called data provenance, these conditions can dictate the reusability or “interpretability” of data. Instrumentation, lab location, and conditions, materials, sample origin, transport conditions, and custody are factors that should be considered throughout the drug development lifecycle. Capturing this information relies on an advanced informatics infrastructure, and significant forethought in system design. There are three key strategies that permit the realization of incremental benefit for any organization:

Recognizing data maturity: Stratify data by level of maturity, and then function (e.g., most active project, candidate, or biological relevance). For instance, data may be considered fully dark or sequestered, shared (perhaps on a shared drive or SharePoint), structured (stored in a database), and ultimately standardized (both structured and harmonized with internal or industry standards).

Exploring integration approaches: The closer data are captured to the time of collection, the less likely they are to become sequestered. Vendor assessments must raise integration capabilities early and often in the evaluation of candidates. Best intentions notwithstanding, data quality problems are rarely remediated.

Embracing a "Data First" cultural shift: A "Data First" strategy acknowledges the importance of data, socializes it, and provides the tools that enable success. The F.A.I.R. data principles, suggesting that data be findable, accessible, interoperable, and reusable, provide valuable implementation guidelines.

To accelerate the early stages of drug development, Big Pharma has increasingly partnered with third party R&D firms that are technology-focused and recognize the power of data. Approaches with origins in manufacturing and QC—for instance, method execution and sample management—have been successfully applied to drug development.

Revealing dark data has the key benefits of replicating optimal processes and conditions, which can ultimately reduce workload and increase efficiency. Error detection, for example, becomes much easier when one can pinpoint the changes in a certain context. By “rewinding” to this point, it is possible to resume development rather than start from scratch. Having developed a strategy for illuminating dark data, precious resources can be recommitted to a F.A.I.R. approach, incorporating smart, informed data strategy from design, to acquisition, to insight generation.

Graeme Dennis is commercial director, Preclinical Pharma, IDBS. He joined IDBS as the commercial director of preclinical pharma in 2018. Prior to IDBS, Graeme held scientific informatics roles in academia and industry, including Accenture, Dotmatics and Vanderbilt University Medical Center. He can be reached at GDennis@idbs.com.