The Great Data Science Promise For COVID-19, Opioid Addiction
By Allison Proffitt
October 12, 2020 | The Bio-IT World Conference & Expo Virtual launched last week with an opening plenary keynote from Drs. Susan Gregurick, Associate Director for Data Science at NIH and Rebecca Baker, Director of the HEAL Initiative at NIH.
Gregurick opened the first virtual Bio-IT World event by painting a picture of the industry poised at the starting line of FAIR—findable, accessible, interoperable and reproducible—data. Wouldn’t it be wonderful, she said, if we could pull data out of papers immediately into our own Jupyter Notebooks, Galaxy, or apply GitHub algorithms?
Of course, that is rarely possible now. Instead, a large fraction of data we generate and publish is left on the table, she said. Measuring and quantifying data use and reuse is tough, Gregurick acknowledged, but estimated that probably 80% of data generated are not really reused in an effective way. That’s a shame, she lamented. Experiments are expensive.
Our view must shift, she argued, to viewing metadata as a “love letter to our future selves” and future researchers—making future research and replication easier. HEAL has done a particularly good job making metadata super rich and standardized in a way that can be computable, Gregurick pointed out.
She’s heard the arguments against keeping data: “Just rerun the experiment!” many say. But we are “data packrats” Gregurick acknowledged, and the data that scientists generate are the tangible product of the work. Data—and metadata—provide instant-in-time snapshots of where science stands today and publishing that data now informs the future.
At NIH, 12 institutes and centers focusing on imaging, genomics, biophysics, biomolecular simulation, and more are working together to make this process more sustainable. For instance, NIH hosts nearly 37PB of genomics data from the Sequence Read Archive on two clouds: Google Cloud Platform and Amazon Web Services, giving researchers faster access to large datasets so that data can be shared easily and connected to a variety of compute resources.
NIH is working on new ideas as well, Gregurick said, including new infrastructure, new tools to let people participate in their health, new approaches to automated health, and new medical data interpretation technologies. Some of these are coming soon; Gregurick said to expect new applications in early winter.
At a conference that is newly virtual, COVID-19 was never far from anyone’s thoughts. NIH is collecting COVID-19 data in BioData Catalyst, All of Us, COVID Cohort Collaborative, Alzheimer’s Disease Sequencing Project, and the Medical Imaging and Data Resource Center (MIDRC), Gregurick said. The challenge now is delivering all of those data to researchers to use. NIH is moving toward a connected data platform ecosystem for COVID-19 data, she explained.
NIH is launching the Research Authentication Service (RAS), a single-sign on effort to save researchers time and effort using standards created by GA4GH and others. In August, NIH deployed a RAS-dbGaP Visa and associated services that allow researchers to log in to RAS one time to access any integrated repository and run an analysis for up to 15 days without re-authenticating.
This is an important first step, Gregurick said, but there is much more to be done. NIH will be piloting a system to link data across platforms via an honest broker—across NIH sites like BioData Catalyst, or even between consumer data sites. “This is not new,” Gregurick said, but “it’s something we’ll be piloting across a number of our systems, to really make de-duplication and linkage of data across a federated system a much more feasible endeavor.”
And she also highlighted “some really interesting work” from GA4GH: their Federated Analysis Systems Project, FASP, and emphasized that there is more to be done to enable workbenches and tools across platforms to be used with Jupyter, GitHub and other resources. “Most difficult is the dynamical and always-fluctuating policy with data access,” she said. “Policy resources for data access are important!”
Finally, she called for information systems that allow users to navigate not just data but information across systems. “All of these make interoperability across data platforms a reality for the future.”
Rebecca Baker, Director of the HEAL Initiative at NIH, is hoping to live in that future. Five years ago, she said, if she had asked her team to make sure all the data they gathered could be re-used, “It would have felt too hard!” she said. Now, though, NIH has many of the foundations in place for data sharing and reuse to be accessible. “The technical challenges feel manageable.”
HEAL’s work focuses on ending opioid misuse, addiction, and overdose. Overdose deaths increased 4.6% in 2019, and the earliest, provisional data since the COVID-19 pandemic suggests a 50% year-over-year increase in overdose in 2020. Opioid misuse and overdose—which was named a public health emergency in 2017—is only getting worse.
Along with opioid misuse are sister crises. Upstream: chronic pain, Baker said. Most opioid addiction begins with a legitimate prescription for pain management, and so addressing the opioid crisis must begin with the pain crisis and how we treat and medicate pain. And downstream: the number of infants born opioid dependent each year. The HEAL Research Programs include $50M a year for prevention research, translational research, clinical trials, and implementation science.
HEAL’s work, therefore, is broad and varied: extended release addition medications, alternative pain treatments, clinical trials on how to best nurture babies born with opioid dependence and their mothers, and much more. The bolus of data that HEAL is generating is extremely diverse, Baker said: clinical trial data, behavioral data, genomic data imaging data, demographic and social determinate data, and much more.
Baker is working on harmonization of data collection across HEAL, beginning with implementing common data elements (CDEs) to facilitate cross-study comparisons. They started with a core set of CDEs—nine pain domains and chronic and acute pain questionnaires. The supplemental set of CDEs is much broader: 375 supplemental questionnaires and 360 additional measures.
It is with these CDEs that the challenges of data re-use arise, Baker said. HEAL wants to make its data FAIR including using unique identifiers, labeling data, using open systems to make data easily retrievable, and being “creative about which parts we make interoperable.” But the cultural challenges still require finesse. Researchers come to studies with a great deal of expertise, Baker said, and many are very set on their study questions and endpoints. We have to find a way to bring them together, she said: marrying their questions with HEAL’s CDEs.
It’s a cultural change, she said, for researchers to run their experiments and ask their own questions, all while NIH encourages adding on additional capabilities so the research can be enriched over time.
HEAL has published a public access and data sharing policy that requires immediate sharing of all underlying data at the time of publication, and is taking the first steps to launch a Gen3 platform, allowing investigators and data generators to submit their data into data management organizations that can be put into a single platform.
These steps will all empower HEAL to begin asking new kinds of innovative questions, Baker said. For instance, How do data on opioid misuse and addiction overlap with prescription data? Which patients are most likely to develop opioid use disorder? Which are more likely to overdose?
“I think we’ve made progress. We have a long way to go, but to me the future looks bright!” Gregurick said.Editor’s Note: Even if you missed the start of the event, Bio-IT World Conference & Expo virtual is still live. Register now for on-demand presentations.