New Standards, Initiatives For GA4GH
By Allison Proffitt
October 23, 2019 | BOSTON—The Global Alliance for Genomics and Health (GA4GH) yesterday announced the unanimous approval of five new standards to enable responsible international genomic data sharing. The standards were approved in September and the approval announcements were made at the organization’s 7th annual Plenary Meeting held this week in Boston.
The five new standards—Crypt4GH, Variation Representation, Phenopackets, Tool Registry Service API, and the Data Security Infrastructure Policy—were developed as part of the GA4GH Connect five-year Strategic Plan, announced in 2017.
As part of a larger suite of deliverables, these standards serve as a blueprint for a federated network of responsible, secure genomic and health data sharing. The standards address issues in data security, cloud computing, phenotype and variant data exchange, and the ethical implications of personal data use.
“The collaboration and effort the work stream contributors have put towards the production of these standards is helping all of us in the genomics community as we work to advance precision medicine,” said GA4GH Chair, Ewan Birney, in a press release.
The new standards were produced by active contributors collaborating across each of the eight GA4GH Work Streams, with input from its 23 Driver Projects, to meet the present-day needs of the genomic data sharing community.
“The newly approved standards and updates are a major milestone in our work under GA4GH Connect, and we anticipate several more standards will be approved in the coming months,” said GA4GH CEO Peter Goodhand in the same statement. “We are also launching an update to the GA4GH Connect roadmap that accelerates our goal of enabling a federated, interoperable network of genomic data tools and resources.”
Leads of each project presented their standards to the group along with many other updates from GA4GH Work Streams. Some projects are well-established, like beacon APIs. Others are new standards going up for a vote at this meeting including GA4GH Passports, RNAget, Data Repository Service API, authentication and authorization infrastructure (AAI), and more. The newest approved standards included:
Alexander Senf, Crypt4GH Product Lead and Scientific Programmer at EMBL’s European Bioinformatics Institute (EMBL-EBI), presented the Crypt4GH update. When you’re moving from general bioinformatics data into human data, you find yourself in an environment with a lot of needs: access control, specific regulatory environment, data security, different file locations, regulations like GDPR and HIPAA, he explained. Everything must be secure.
Crypt4GH is a new standard file container format that allows genomic data to remain secure throughout their lifetime, from initial sequencing to sharing with professionals at external organizations.
The goal is to make secure use of genomic data easier, Senf said. Analysis tools can be used without decrypting data first. Existing applications for reading and analyzing data, such as SAMTools, readily support Crypt4GH, allowing users to interact with the data while still in an encrypted state. In addition, data access can be granted on a very personalized level with personalized keys to unlock data, he explained.
Reece Hart calls variation representation a “lingua franca for exchange of information between systems.” The standard is an extensible framework of computational models, schemas, and algorithms to precisely and consistently exchange genetic variation data across communities.
The key advantage is that the shared identifier isn’t assigned, but rather the identifier is computed from the data itself. If you are holding an instance of variation, he explained, you can compute the identifier and everyone else in the world can compute the same identifier the same way. The caveat, of course, is that the identifier is more machine use.
Hart said that the components of the computed identifier algorithm aren’t novel—though the collection may be—but this is why the standard is easy to implement.
We have computable encodings for genomes but not for phenotypes, pointed out Melissa Haendel from Oregon Health & Science University, but phenotypic information is just as important. We need a FASTA format for phenotypes, she said, a standard way to share phenotypic information that is not free text—and not EHR data exported via PDF!
Phenopackets is a file format that allows phenotypic information to be represented alongside genotypic and medical information for standard phenotypic data exchange within medical and scientific settings.
Clinically Haendel expects phenopackets to help us better characterize phenotype characteristics: What’s not there? How are these traits linked to genomic data, family phenotype, etc.? When were phenotypes first observed and how did they change over time and with treatment? For research, she hopes phenopackets will facilitate being able to collate more de-identified phenotypic data from around the web and query against them.
Tool Registry Service API
Denis Yuen, Ontario Institute for Cancer Research, called the Tool Registry Service API “our idea for moving bits of computation around”. The API is a standard for exchanging tools and workflows to analyze, read, and manipulate genomic data, allowing genomics researchers to bring algorithms to datasets in disparate cloud environments.
Each tool exist is in a Docker-like container, Yuen said, and multiple tools can be linked together into workflows. The API already has users in DNAstack, the GA4GH testbed, Terra, Seven Bridges CGC and Cromwell.
Data Security Infrastructure Policy
Data Security Infrastructure Policy or DSIP is not completely new, explained Jean-Pierre Hubaux, Swiss Federal Institute of Technology, but it was due for a substantial update and revamping. SIP, the previous iteration, was meant to be a help and move to concrete implementation, he said. DSIP assigns responsibilities to data security stakeholders.
DSIP is a set of security best practices for standards development and implementation within the context of GA4GH to facilitate the responsible sharing and processing of genomic data. In updating the standard, the Work Stream chose to adopt GDPR terminology.
The updated standard provides recommendations for identity management, authorization and access control, privacy protection, audit logs, and more. DSIP provides a set of guidelines enabling a safe, robust, and trustworthy tech infrastructure, Hubaux said.
The five new standards weren’t the only updates. Birney also shared that Heidi Rehm, Massachusetts General Hospital and the Broad Institute, and others are working on a gap analysis effort for GA4GH. While the alliance has historically been driven by bottoms-up driver projects, the gap analysis effort will assess “whether we have all that we need to build a functioning eco system,” Birney explained.
The strategic roadmap committee has conducted 14 calls with work streams and drive projects and issued a survey to the community, Rehm explained. While it’s still early, she did highlight some of the emerging themes coming from the conversations.
Among them: a needed focus on implementation support. We aren’t implementer, Rehm emphasized, but we want to facilitate implementation and make sure our standards are functional. Many work streams and driver projects called for more collaboration across work streams to enable interoperable suite of standards. And more guidance was requested around data governance approaches.
A white paper is coming, Rehm said, on different aspects of federation: What does federation really mean? When should data be federated and when not? Can we still centralize some datasets?