Data Vision: How Pharma’s View of Data is Changing Healthcare

By Allison Proffitt

June 8, 2021 | Kicking off the DECODE: AI for Pharmaceuticals forum held virtually this week, speakers from big pharma discussed the cultural challenges required to become a data-driven business and how doing so might position us for the next horizon in personalized medicine.

Jacob Janey, scientific director, chemical and synthetic development at Bristol-Myers Squibb, he immediately flagged the biggest hindrance to the data-driven business vision: garbage in, garbage out. “If you don’t have good data quality to begin with—and that means labeled, contextualized, annotated—then you’re probably not going to be getting a very good model,” Janey explained.

But of course that data sanitation takes time and effort. “Data scientists in general across the industry spend the bulk of their time just wrangling together disparate sources, cleaning the data, labeling the data, ensuring that they understand where it came from and how it was generated. The devil is always in those details. Hence my group is very invested in automation to help make that a little bit simpler,” he said.

Paul Bleicher, founder of PhaseForward, most recently at Optum Labs, and now principle at Evident Health Strategies, agreed, but cautioned against a data-first approach. People tend to focus first on data or algorithms, Bleicher said, missing the more fundamental question: What problem are you seeking to solve and how—if solved—will that create value or quality for the business and the patients. Only then, Bleicher said, you begin to ask: “What data would you need? Which of the datasets that we have access to can be used? When will that data potentially create bias? Where will it create issues? Once you have that all together, figure out what algorithms and the way you’ll put it all together. It’s essential to really not only understand where you can get sources of clean and curated and quality data, but even if you have all of that: does the data fit the solution that you’re trying to create? And will it, in fact, create more bias that good answers.”

Raj Nimmagadda, global head of the R&D data office for Sanofi, turned the conversation to people. “Not all data types are the same; not all data sources give the same quality levels. We need to educate the people and set the expectations on what they can expect from these datasets,” she said. “There’s a perception in second user communities that the data quality has to be the same across the board, across data sources.”

Bleicher agreed and pointed out that the data don’t need to be the same. “There’s a tendency—especially when you’re thinking of collating data you may eventually use—to bring in as much data as possible and clean it up as much as possible,” he said, “but what Raj just pointed out is very important. It’s better to spend those economic resources where it will make a difference: to get the data clean to the point where it provides the value that you’re looking for.”

“What is good enough to solve the problem?” Janey challenged. Though he pointed out that “good enough” data will depend heavily on the question you are seeking to answer or the problem you hope to solve. The dataset used to set a research directionality can be much noisier or sparser than, say, a dataset answering clinical questions.

He also recommended a similar approach to algorithm choice. “People tend to jump to deep learning or neural nets when sometimes it could be a simple regression or a simple random forest, which has its own benefits,” that is the “tool that’s sufficient to the purpose.”

The panel warned of spending all of your available time, money, and resources getting datasets so beautifully cleaned that there is no bandwidth left for using and acting on the data.

Data in Clinical Research

Bleicher, specifically, has experience bringing data together from electronic health records, insurance claims, and clinical trial sites. Each data type, he said, has its own strengths and weaknesses. For example, claims data tends to be relatively structured, but even with ICD10, Bleicher says the granularity of claims data is lacking, limiting the questions you can ask of the dataset. EHR data, on the other hand, comprises care patterns that vary greatly from institution to institution, and an estimated 70% of the value in the EHR is in unstructured or text data.

While molecules aren’t quite as complicated as patients, Janey said, he also finds himself with a treasure trove of unstructured data. “It’s a very large institutional challenge to think about. A: how do we get stuff out that’s usable, that’s labeled and annotated and organized correctly, and B: going forward how do we avoid this?” There is a roadmap for moving forward, Janey said, but there is much hard work ahead. For instance: developing a robust ontology.

Data Governance

Sanofi’s Nimmagadda pointed out that the number of data sources and data types is overwhelming—even before you wrestle with the unstructured data. The solution, she said, is data governance. “How do we govern them? How do we ensure that we have proper policies in place to share the data?” This focus on ethics, consent, and standards isn’t just a bureaucratic process, she emphasized. “What we are trying to get with data governance is we are trying to accelerate the data access so we can get insights out of it.”

Gartner and others estimate we spend about 80% of our time “data wrangling”, Nimmagadda reported, leaving only 20% of our time to actually use the data. “In the industry, all of us are trying to change that percentage,” she said. “Data governance will help us to bring in the discipline, to bring those standards and policies to accelerate the data access.”

Carefully designed governance policies will enable better use of our data, she said. Clinical trials data, for instance, should have second-life uses. “We are spending millions of dollars in conducting clinical trials. Can we utilize that clinical [data] beyond the purpose of the clinical trial, in research, in commercial?”

Pre-Competitive Opportunities

These sorts of data governance frameworks are a prime opportunity for precompetitive collaboration, Janey said. Models for data sharing and data governance are being pursued across pharma, but “we probably don’t have a lot of IP interest in the model itself,” he said. “I very much would like to see a lot more inter-company collaboration, pre-completive collaboration on some of these modeling efforts. There’re fantastic technologies out there to encode the data source and use encoded models,” he added. The industry’s default posture is to closed data access, but Janey argued for a paradigm shift defaulting to open access instead, limiting data access only when necessary.

“For sure industry can collaborate,” Nimmagadda agreed, but “it’s not an easy topic to tackle.” There are local policies and laws to adhere to when sharing data, she pointed out, but there are also cultural hurdles. “People think their data is safer in their own systems, not open to share,” she said, but she agreed with Janey. “By default, data should be open to all.”