How PetaGene 'Nailed It' With Their PetaSuite Protect

By Dan Greenfield

July 30, 2019 | In April, we announced the winners of the Bio-IT World Best of Show Awards during the Bio-IT World Conference & Expo. These awards are given with the goal of recognizing the best of the innovative product solutions for the life sciences industry on display at the conference. We wanted to highlight these products as they measurably improve workflow and capacity, enabling better research. – The Editors

With the rapid growth of genomic data and the need for access by cross-functional teams, it is vital to ensure that this sensitive data is adequately protected and monitored. This can be a significant challenge for organizations responsible for its management. It is a challenge that some do not meet, probably due in part to ignorance of the requirements and a lack of tools to enable compliance.

For example, in 2018 it was reported that 86,000 Danish blood samples from which DNA was extracted and sequenced were held by a foreign entity with no way to ensure compliance with Denmark's Personal Data Act.

Data sharing is critical to advancing scientific research. However, considering what the minimum amount of private health data a genomic researcher requires in order to answer a specific research question is crucial. Currently, with exome and whole genome sequencing, permissions for accessing the data has been on a whole-file or whole-object basis, unnecessarily making all the data visible. Furthermore, there is little ability to track usage.

Key questions currently being unanswered by data stewards include:

Who is accessing what samples?
What is being accessed within these samples?
Under what authority are they accessing these regions?
What are they doing with that data?
How is compliance with applicable regulations demonstrated?

This is a matter for internal security as well as demonstrating compliance when allowing external access, especially across jurisdictions. The ability to apply fine-grained control and audit for access are key capabilities in this respect.

PetaSuite Protect enables organizations to manage access to their genomic data by internal and external teams, secured with fine-grain regional encryption and deep auditing of data usage. Moreover, this is done in a manner transparent to existing tools and pipelines and integrates with existing on-premises and cloud storage infrastructure with no modifications needed.

Innovative technology

PetaSuite Protect consists of three main parts:

PetaSuite - available as RPM and Debian packages, which now performs both compression and encryption of genomic data.
The PetaSuite Protect Management Server - this new element allows data stewards to set up and manage user access to genomic data files. It is accessible to data stewards and users within the organizations using the data, and optionally to external collaborators.
PetaLink library - now performs just-in-time transparent decryption as well as decompression of PetaGene-compressed genomic data into BAM and FASTQ.gz file formats.

PetaSuite Protect adds the following new capabilities to PetaSuite's already state-of-the-art compression performance:

FIPS 140-2 compliant regional AES-256 encryption for entire genomic and non-genomic files. Furthermore, unique keys per region per file can be applied to aligned genomic data.
High performance client-side decryption and decompression for scalability.
User-specific GA4GH-adherent fine-grain access permissions within the same file. For example, Alice may only have access to a subset of chromosome 1, while Bob may only have access to a subset of chromosomes 8 and 22. If they try to access other areas they see an empty region of data.
Audit log - a searchable cryptographic ledger which records the following information every time the data is accessed:
- The identity of the user
- The file/object and region accessed by the user
- The command line of software with options used to access the data
- Date and time of access
- Data Provenance
Pipeline Protect - when an encrypted/audited file is used as an input by a process/pipeline-stage, this enforces encryption/auditing on the output file of the process.

These new features complement PetaSuite's existing capabilities which include:

Transparent just-in-time access as the original file.
Full bit-for-bit preservation of the original BAM/FASTQ.gz file.
Transparent object storage integration - enabling unmodified POSIX-compliant tools/pipelines to support high-throughput access to object storage.

By adding the browser accessible PetaSuite Protect Management Server and the encryption/decryption capabilities, PetaSuite Protect builds on existing PetaSuite compression to simplify the process of achieving compliance. It provides fine-grain AES256 regional encryption, access control and auditing capabilities that are transparent to end-users. In this way, the minimum necessary data is shared and it is audited deeply. Importantly, to assist adoption, there is no need to change existing tools and pipelines whether data is stored on-premises or in the cloud.

Organizations can allocate GA4GH-defined data management roles. Every user access is logged in a tamper-evident and easily searchable cryptographic ledger. Not only is user and file information recorded but also details of what application was used for access, and what genomic regions were read. Furthermore, decryption and decompression are performed on the client with a transparent high-performance library, rather than by the server. This ensures high scalability across multiple users.

With PetaSuite Protect, users see regular genomic files. When they access these files, they only see the specific regions that they have permission to view. PetaSuite Protect gives live information on the use of genomic data by those parties, and the ability to immediately grant or revoke access privileges.

The state-of-the-art prior to PetaSuite Protect is to grant access to users of genomic data on a whole-file or whole-object basis, which means the person it belongs to might be identifiable. While some file-systems support auditing of access by internal users, there is very little visibility into what users are doing with this data. And when granting access to external users, there is typically no visibility at all once the data have been transferred to them. Now it is possible to actively control which regions of a genome to make visible to an individual researcher, audit their use of it and record this information for compliance purposes.

Dan Greenfield is Co-Founder and CEO of PetaGene. He has a Master’s degree in bioinformatics and a Ph.D. from Cambridge University, Cambridge. He can be reached at dan@petagene.com.