Bryn Roberts: Reflections on Pharma’s Scientific Computing Journey
By Stan Gloss and Loralyn Mears
November 10, 2020 | TRENDS FROM THE TRENCHES—Pharma has had a long history with data. Of course, it wasn’t always digital but there’s always been a data trail. Experiments from centuries ago, particularly the mid-19th century, which was the era of pharmaceutical breakthroughs in vaccination, pasteurization, and microbial fermentation, left a data trail. Every piece of data maps out a story and data are the ultimate renewable resource—you can use them over and over again in different ways to make new discoveries and create new hypotheses.
Around that period in history, scientists began widely embracing the practice of recording experiment details in lab notebooks. Louis Pasteur, the father of germ theory, kept meticulous records of his ground-breaking experiments which ultimately refuted the theory of spontaneous generation. The bulk of Pasteur’s lab notebooks were held in secret by his family with the exception of his 1847-1848 notebooks with his work on chirality, which went missing for nearly a century. The others were not released until the 100th anniversary of his death. Analysis has since exposed a few controversies and inconsistencies which highlight the point that data may become inaccessible, but they don’t go away.
These examples parallel pharma’s journey with scientific computing and we wanted to get a historical perspective. So we asked Bryn Roberts from Roche, a thought leader in the industry, to share his insights. Roberts, originally a pharmacologist, has spent almost three decades in the pharma industry, mainly in roles related to data science, information technology, and automation. Today he oversees a broad range of capabilities in R&D operations. Generating and collecting data has never really been a problem. Managing it, however, has always been a challenge.
“If we go back 20 years in pharmaceutical R&D, we had relatively little data. At that time, before I joined Roche, I remember us bringing the first big storage network online for a former employer. It was considered huge at that time, yet it was only around 10 Terabytes—little more than a couple of PC hard drives today. Technical people have traditionally focused on issues with storage, network bandwidth, compute power and so on, but what’s changed for me (and many other data scientists) since then, is how I think about data. Now, I think more about the value that data can bring through the insights we can extract,” said Bryn as he reflected on his personal journey with scientific computing. “It used to be all about ‘my data, my data,’ and now it’s become ‘our data’ with this mindset shift where data are now regarded as a valuable company asset.”
Interoperability has been a key theme—and goal—for decades. The rise of the FAIR data principles (Findable, Accessible, Interoperable and Reusable) has spurred pharma to make connections between and among scientific fields like biology, chemistry, and clinical information, which were typically disconnected in their own data islands. “Today,” Roberts said, “We realize that we’re more often asking questions that span these domains. We’re translating information, raising questions, and iterating as more data move back and forth between the lab and the clinic. When we talked about these concepts 20 years ago, they were quite esoteric and we’re still not quite there yet but we’re getting close to where we want to be.”
Data are now routinely being moved into data lakes to enable FAIR but Roberts has mixed feelings about the approach. “The principle of bringing data together in a way that you can access them across the different domains of an organization is, of course, very valuable. The worry is how and why it's done. Volume-based metrics are of little value and often drive the wrong behaviors. Asking key questions around the business benefits enabled by those data lakes and tracing critical decisions that have been supported through big data FAIR-ification projects is more important. I think the focus now is on the secondary use of data; 20 years ago, focus was on the primary use. Maybe this is one of the key ways that scientific computing has evolved: we now think much more about the secondary and tertiary uses of data and insights.”
Data harmonization becomes an important aspect here: making data findable and accessible across an organization can be achieved by sharing data dictionaries, for example. Contextualization influences how useful those data can be. “You want to capture the context (metadata) in which the data were created so that you can later understand the boundaries of how the data can be used and which questions can be asked of the data. Roche has been running a program on Enhanced Data & Insight Sharing (EDIS) for the last few years,” Roberts said, “where we FAIRify clinical data assets which were, like most pharmas, historically siloed by therapeutic program or disease area. Now, we can ask questions that span trials and projects, and that’s where the contextual data become important around the clinical protocols, how tissue samples were collected and treated, for example.”
Data Citizenship and Stewardship
One of the biggest data transformations has been a cultural one where the collective mindset of researchers has shifted from “my data” to “our data” as we realize that data are a collective company asset. With data now broadly shared across research units, scientists naturally raised questions around ownership and what happens if another team within the company makes a discovery and publishes the content before the original owner does? Roberts said that they’ve had numerous discussions around these questions as part of the EDIS project.
“We discussed and developed our approach to data citizenship. That describes certain rights and certain accountabilities, obligations, if you like. As a data creator you may have a degree of control over what the data you generate are used for, even if only for a limited period of time. As the consumer of data generated by somebody else, obligations may include the need communicate with them, discussing what you intend to do with the data, and sharing the results—general courtesy in a collaborative ecosystem.”
The concept of “our data” is inherently complex as it’s not limited to the interests of a single organization. Much of pharma’s research is enabled through collaborations, contracted work, M&A, licensing agreements and so on. There are legal, ethical and privacy considerations, for example. Beyond data citizenship, stewardship comes into play, the responsibility of caring for data while they’re under your scope. Stewards are also obligated to leave data in a state of value that is equal to, or greater than, when they inherited them.
Data Are A Renewable, Shareable Resource
That's the beauty of data as a resource: the more they’re touched and used, the more they get refined. If data are passed from steward to steward, the dataset is continually enhanced. The human genome, for example, becomes more and more valuable with each enhancement and as it gets embedded as part of the scientific culture. Given today’s computing technology environments, there isn’t any finite limit imposed on how often data can be used and iterated upon. This underscores the value of data and the important role of stewards to preserve its integrity.
Sharing real-world data and knowledge has been changing the game in clinical trials over the past few years. Doing so exposes numerous complications but, according to Roberts, it offers two significant advantages. “One, you don't have to put patients through what are essentially unnecessary comparator arms in trials, if you already have comparable, high quality, real-world data on how the standard-of-care treatments perform. And two, these data represent the real-world setting, which is the true baseline that we’re trying to improve with innovative new treatments.”
Digital twins represent the next frontier in the application of real-world data to clinical trials. Human digital twins represent an extraordinarily complex and advanced state of computing—one that we’re not quite ready for—although provocative and potentially immeasurably valuable. Roberts shared some of Roche’s efforts using digital twins for their new research buildings and ecosystems. Users can test things in the virtual setting then implement in the real setting. “It's really effective for things like energy saving, CO2 minimization, airflow optimization, all these kind of things. The challenge, of course, with doing a digital simulation or a twin of a human is almost infinitely complex. However, by breaking the problem down, applying the aphorism that “all models are wrong but some are useful” [George Box], we already have useful models for certain biological processes, organs and systems—so-called Quantitative Systems Biology or Pharmacology”
Perhaps one could run an algorithm for the predicted outcome of a clinical trial for an anti-cancer drug against a personal digital twin to see how it would respond. It’s an intriguing concept, yet ethically we quickly start to bump up against real, serious privacy issues. Ethical considerations will need to be addressed in parallel to the development of advanced computing scenarios that replicate natural biology—and we’re a way off from enabling either at this stage.
Conceptually, with data regarded as an asset that improves in value with each enhancement versus expires with consumption, knowledge capture and sharing of knowledge becomes the new edict for data scientists. What becomes essential now for data stewards is to capture what the scientists (or others who touched the data) did with it. Ideally, the goal is to preserve why they used the data in the way that they did to enable correlation with the outcomes and insights generated as a result.
Data science has evolved beyond the point of calling out SNPs to actually manipulating the data, and Roberts emphasized the importance of capturing those manipulations and outcomes in a structured way. Verifying those insights opens new possibilities, he said. “Suddenly, what was tacit knowledge in a data science community is made explicit, because why the experts did something that way is captured, and can be reused by others in the future. Imagine a future augmented intelligence system prompting us with suggestions like ‘for similar questions of similar data the most successful strategy has proved to be to XYZ!’.”
This becomes the essence of discovery; empirical medicine is heavily based on it and facilitated through screening and testing. This is the intersection of serendipity and science, but the de novo, prospective design aspects is where it gets interesting. Bryn commented, “One of the things that does change, I think, with this huge volume of high-quality, relevant data and modern data science approaches, such as AI, is the ability to progress away from a pure discovery paradigm to include more de novo model-based design.”
The flipside of data being a renewable resource is the deliberate termination of a data journey. Do data ever go away? “That’s the call that no data scientist wants to make,” Roberts said. “When storage is relatively cheap it’s tempting to fall back on keeping everything ‘just in case’. The reality is, however, that there comes a point where data are no longer valuable to an enterprise, they may even be a net burden or liability, and should be removed from the stack—particularly data generated using old technologies and where there is insufficient contextual metadata to allow them to be re-used.”
If we theorized on how the Industrial Age was akin to the Information Age and how each preceding age supports the coming of the next age, we can consider a few provocative concepts. We’re currently in the Biological Age where generating high volumes of data is “still really important, but actually quality is king,” Roberts said. “How many of the quality parameters can be captured by instruments in the process that can support questions later, like the veracity questions of the four Vs of big data? That's really, really important.”
Looking back at pharma’s scientific computing journey, Roberts argued that the laboratory automation of the ‘90s—high-throughput screening, sequencing, combinatorial chemistry, etc.—was a kind of an industrial revolution for pharma research. “Things were highly automated; we had parallelization, labor arbitrage and many of the features you typically see in an industrial setting. Automated capture of metadata is super important for the next age. If you think about moving into Industry 4.0, the cyber-physical era, with ubiquitous digitalization, the Internet of Things (IoT) and connections and compute via the cloud, we’re pretty much there now moving into the Analytical Age.”
One can’t talk about analytics without raising the game-changer: machine learning. Roberts postulated, “I think one of the biggest impacts of the last five years has been deep learning. I think we have really started to understand, for example, how much information is encoded in a scientific or medical image, through the application of convolutional neural networks to large image collections.”
Is the Quantum Age beyond the Analytical Age? Quantum computing is currently of great interest to Roberts. “To date, there are a few tens of algorithms that are postulated or proven to be accelerated using quantum computing, such that what is incalculable today, even on the biggest conventional HPCs, may be within reach. Are there algorithms applicable to drug discovery (or design) that could change the rules of the game, once a universal quantum computer of sufficient scale and robustness is available? That, for me, is the killer reason to be engaged.”
He continued, “We're really at the start of this disruption, which opens up the possibility of utilizing novel algorithms that exploit quantum properties, such as superposition and entanglement. These are the phenomena where a quantum bit (or qubit) can exist simultaneously in multiple probable states and where the state of one qubit can be made dependent on another that it is paired with. Quantum simulation, such as calculating energies of interaction between potential targets and potential drug molecules would be one interesting ‘native’ use case—a quantum system modeling a quantum system.”
It's not clear which age is next, but one thing is certain, pharma has already had a long history with data and the scientific computing journey has only just begun.
For more from BioTeam, see the Trends from Trenches newsletter: https://bioteam.net/newsletter/