The Need for a Resilient Data Foundation for Drug Discovery and Manufacturing

December 3, 2020

Contributed Commentary by Charles Fracchia

December 3, 2020 | More than any other modern crisis, the COVID-19 pandemic has highlighted the complexity, interconnectedness and global nature of drug development and manufacturing. This public health challenge has also made it painfully clear how critical these workflows and processes are. It is without exaggeration a matter of life of death for vulnerable populations around the world. This crisis is also coming at a time when many of the larger pharmaceutical companies are in the middle of company-wide digitization initiatives, modernizing their processes to take better advantage of their data. This convergence represents an opportunity to speed up and improve drug discovery, but must be considered carefully to ensure that future infrastructure is robust both to increased demand and increased levels of attack.

The Bioeconomy is Digital

Today’s instrumentation, workflows and analysis tools are predominantly digital. From instruments generating terabytes of data per run, to machine learning algorithms tuned to predict the optimal drug candidate using this information – they all produce and consume an unprecedented amount of digital data. In recent years, a number of companies were created to take advantage of this digital data revolution, including the likes of Recursion Pharma, Insitro, Sana Therapeutics and Cellarity. The increased use of contract research organizations (CROs) is adding to this trend by generating large volumes of important digital data in more geographically and organizationally diverse environments.

Digital Opportunities and Risks

All this digital data is providing new opportunities to discover more effective candidates in shorter timeframes, leveraging approaches such as machine learning and cloud computing capabilities. However, this digital information is also easier to manipulate, steal or hold for ransom, thus creating new challenges for the industry. Due to its highly distributed nature -both geographically and across the drug development workflow- the task of securing such data is actually quite complex. A company must now take into account a new perimeter of vulnerability, one that includes laboratory instruments, instrument computers, bioinformatics software and even the external resources loaded by said software. This problem gets further compounded by the need to ascertain a certain level of security with partners and CROs as well, otherwise running the risk that contaminated files, analysis results, software or even 3rd party access be exploited. These are not hypothetical risks either, as we have seen them play out in the power industry, industrial control systems, biological academic centers, and even actively at companies working to find countermeasures to the COVID19 pandemic.

Building the Bioeconomy on A Strong Foundation

While the task may seem daunting from where we stand today, we can -and we must- tackle this problem and build a strong foundation that will ensure a resilient bioeconomy for decades to come. There are two primary factors we must consider in order to achieve a resilient, distributed data infrastructure for drug development: 1) integrity and security and 2) access and consumption.

No Data Security Without Data Integrity

Typically, the problems of data security and data integrity are discussed separately. The former often referring to encryption, while the latter term focused on chain of custody and regulatory requirements. However, in the current climate of pervasive cyberattacks, and the dominant nature of digital data in modern drug development activities, these two aspects have grown more intertwined than ever before. Attackers are actively targeting biomedical institutions and in some cases with the goal to hold critical data for ransom. In this scenario, any weak link in the data’s chain of custody and integrity verification, is a position that an attacker can exploit and surreptitiously modify data for every step downstream. Current systems for data integrity have primarily focused on providing a compliance check. We must now be more proactive with regards to data integrity and monitor integrity and security at every link and build systems to alert key users when deviations occur. This is a necessary evolution for the protection of our infrastructure against ransomware-style attacks. 

Data Access and Consumption

In our field, strong encryption and data security measures have historically not been a major consideration. Instrument and software vendors rarely -if ever- implement data encryption capabilities into their products, leaving the majority of tools underpinning our infrastructure exposed. While more often available, authentication solutions primarily revolve around compliance needs or user management and are rarely designed with the intent to extend security “in depth”. For the bioeconomy’s infrastructure to be resilient, data access and authentication must be designed into the system at every layer. This means having granular attribute-based and role-based encryption mechanisms for data, thus providing control over bits of data as well as people’s access to said bits in a detailed way. This approach must be implemented all the way from the moment the data is generated on an instrument, until the point it is stored. Not implementing these approaches will mean that attackers will continue to be able to easily steal intellectual property and at the vast scale that we see today.

Charles is dedicated to creating the smart laboratory of the future driven by the move to datacentric research. Before founding BioBright, Charles worked for IBM Research and Ginkgo Bioworks. He has been a speaker at a number of technical and policy venues, including the White House. In 2016, he was named one of 35 innovators under 35 by the MIT Technology Review. He received his graduate education between the MIT Media Lab and Harvard Medical School and obtained his bachelor’s degree from Imperial College London. He can be reached at