Bio-IT Experts on Keeping Data Secure, Accessible for Research

By Allison Proffitt

May 31, 2022 | In a wide-ranging discussion on data security this month at the Bio-IT World Conference and Expo, panelists dug into how the data we have can be better positioned to inform research in the future—and what that means for both patients and researchers.

Jonathan Silverstein, Chief Research Informatics Officer & Professor at the University of Pittsburgh, kicked off the conversation by repeating a goal he’s been championing since 2011: data liquidity, data that goes where it is needed. There are paths toward secure data liquidity, he said, data that are both useful to research and protected at the same time. The solution, Silverstein emphasized, is a socio-technical architecture: deep policy understanding coupled with deep technical capability to allow data to flow.

This balance of both tools and data culture came up again and again in the discussion. There is not, the panel agreed, a magic pipeline or architecture or data environment that will solve our data problems. Instead, solutions will include tools, architectures, standards, and cultural shifts in how we view and use data. The cultural shifts, Silverstein pointed out, can be particularly challenging between the research and clinical communities who have different goals and priorities for the data they generate.

Silverstein outlined two examples of how the University of Pittsburgh is addressing challenges. HuBMAP—the Human BioMolecular Atlas Program—is a consortium of 60 institutions and hundreds of investigators all working together to develop tools to create an open, global atlas of the human body at the cellular level. Researchers are funded by the Common Fund at the National Institutes of Health. Data within HuBMAP can be highly sensitive data only to be used within the consortium, metadata that everyone can use, or data meant to be consumed by approved users. The architecture is a hybrid of Amazon Web Services with on-site computing at the Pittsburg Supercomputing Center. Globus manages authentication and authorization as well as transfer.

The Health Record Research Request (R3) service comes from the University of Pittsburgh’s Research Informatics Office and the mission is to support investigators through innovative collection and use of biomedical data. The data source layer contains personal identifiers but the next-step aggregated layer is stripped of HIPAA identifiers. The transformation layer, Silverstein said, includes terms and value sets. Data are delivered with single sign-on by Globus across the institutions.

NIH’s Plan to Share

While R3 manages quite a bit of data from across the University of Pittsburg’s 40 hospitals, Rebecca Rosen, director of the NICHD Office of Data Science and Sharing, shared about an even great data volume coming. The new NIH Data Management and Sharing Policy goes into effect January 2023.

Rosen emphasizes that the new policy requires researchers to submit a data management and sharing plan. “It’s not just ‘Thou shalt share.’” Rosen pointed out. “The requirement is ‘Thou shalt plan.’ You should plan for how scientific data and any accompanying metadata will be managed and shared—so that’s planning across the research life cycle.” This plan should consider any restrictions and limitations to sharing the data and should help researchers think through allocating resources. Final plans are reviewed by NIH and, once approved, become part of the terms of any future awards.

“In the next five years, we are going to have an incredible wealth and diversity of new biomedical research data getting out to the research community. It’s beholden on all of us to figure out how to make it both accessible and secure those data to protect the privacy of the research participants and the confidentiality of their data,” Rosen said.

Her team has been working to set up an ecosystem to support access to and security of those coming datasets using human-centric design practices, which means getting input from community members on how they plan to use the data; using open metadata and software standards; following privacy and security by design principles; ensuring that NICHD datasets and specimens are FAIR; and—importantly—leveraging the NIH Research Auth Service.

The Researcher Auth Service—RAS—is a relatively new identity broker that offers single sign-on and multi-factor authentication that allows researchers to carry their approvals with them as tokens based on GA4GH standards. “They carry that token with them wherever they go,” Rosen explained. “They get a single sign-on experience, but the system that receives the token knows exactly what they are authorized to access.”

Currently, RAS is accessed by eRA, Login.gov, and NIH logins, and other IDs will be accepted soon from ORCiD and others.

RAS is a paradigm shift for NIH, Rosen said. Before RAS, data repositories all managed their own authentications, which was risky and time consuming. Centralizing authorizations reduces risk to the data repositories and across the data ecosystem because NIH has visibility into researcher activities.

Data Access Frameworks

Rachana Ananthakrishnan, executive director of Globus at the University of Chicago, has seen many research data environments. Globus offers unified data access, data transfer and sharing, platform-as-a-service, and other services to increase the efficiency and effectiveness of researchers.

Picking up where Silverstein left off with HuBMAP, she explored how Globus provides federated authentication and local policy enforcement to improve the end user experience. It requires that we invest time and effort in shared security standards and cohesive semantics in how authorization is expressed, she said.

“This becomes a force multiplier,” Ananthakrishnan continued. “You really are able to leverage that and dramatically increase the reach and be able to build up collaborations with much more ease than you can if you’re going to do this each time over. RAS is an excellent example of this.”

Rather than building secure enclaves for data, Ananthakrishnan explained, Globus takes the approach of putting policy overlays over the shared infrastructure for HuBMAP. The “secure enclave” approach results in locked away data, usability issues, and varying security standards for different users, she said. Configurable policy overlays, instead, meet the security requirements of each particular dataset, while improving user experience.

But Ananthakrishnan also pointed out cases where the question isn’t how much a researcher can see, but how even a little bit of accessible data can be valuable. The Cancer Registry Records for Research (CR3) includes data from several university cancer registries and Globus wanted to tackle the problem of cohort building. Could researchers search to see if enough data exists to build a research cohort?

With user experience as a priority, Globus used a federated data model to let institutions decide what metadata to add to secure search indexes. They found that even when groups only contributed aggregated metadata, users were still able determine if there were cohorts there. A key learning from this, Ananthakrishnan pointed out, was the value of differentiated access policies for metadata and data. “It really changes how people are make this data findable.”

Significant Search

Search is really the key functionality that Ari Berman, CEO at BioTeam, spent time on. Search is extremely complex, Berman said; it tests your data’s findability, accessibility, and interoperability all at once. And, in many cases, it is directly contrary to the goals of data security.

Data protection laws favor privacy over scientific progress, Berman said, and in fact, most of our ecosystem is not fit-for-purpose if the purpose is scientific discovery. For example, satisfying data protection laws often demand sacrificing speed and performance from a hardware perspective. EHRs are designed for payment and compliance, not data analysis.

Berman argued for a more modern approach to search and privacy which would include storage and compute systems that work for any data and platforms that support both personal health information and public data. Interoperability and standards are key and he advocated for faster adoption of clinical data standards such as OMOP, CDM, and FHIR.

Finally, Berman emphasized the value of atomic security that is object/file based and includes metadata. It’s a point that Silverstein picked up again during the Q&A. No matter where your data are stored, Silverstein said, “I think the main point is to keep the data atomically with provenance. Know what it is; leave it in its original form. That’s where everything’s going.”