The Role Of Bioinformatics In Bringing Biobanks To The Forefront Of Scientific Breakthroughs

Contributed Commentary by Zachary Pitluk, Paradigm4

January 21, 2022 | Life science researchers and drug discovery scientists today work in what, in my view, will be remembered in the future as the beginning of a “golden era” where the power of genetic information is connected to a wide range of other healthcare and lifestyle data to support better understanding of disease etiology, as well as the identification of candidates for drug targets and the discovery of new biomarkers that can be used to predict the presence or future development of a disease in an individual patient.

Leading voices have been loud in this regard in recent years, and now, several developments have come together to change the game in practice.

Crucially, biobanks have developed rapidly. They typically aggregate genetic information such as WGS (whole genome sequencing), WES (whole exome sequencing), and SNP (single nucleotide polymorphisms) with a range of other data on the same individuals: health records like GP data, hospitalizations, diagnoses, prescriptions, MRI (magnetic resonance imaging), lab results from biochemistry and haematology, and patient-reported information such as family history, behavioural history, and socio-demographics are all included. And the number and diversity of individuals included has grown, too. Many countries either have or are developing national biobanks, including the UK (UK Biobank), China (Kadorrie), Japan (Jenger), the US (All of Us) and Finland (FinnGen). The UK biobank (UKBB) has 7,400 categories of phenotypes along with single nucleotide polymorphisms (SNP) and WES data from 500,000 participants. Organizations are also now working to facilitate more collaboration between biobanks with, for example, the International Hundred Thousand Plus Cohort Consortium (IHCC) bringing together over 100 cohorts in 43 countries comprising over 50 million participants. (https://doi.org/10.1016/s2589-7500(20)30242-9)

The resulting massive multi-dimensional data sets can be used for computing large association analyses to relate specific genetic variations and specific phenotypes with susceptibility to, or protection from, certain diseases.

A changing landscape

Against this background, it is not surprising that algorithms, database platforms and IT tools have needed to evolve, too—and the role of the bioinformatician is now, more than ever, center stage.

If I think back just a year or so, the landscape of bioinformatics possibilities was dramatically different to what we have today. However, despite many advances, challenges remain to marry computer science expertise with biological insights.

There is a significant risk that researchers spend too much time navigating complex data management, like ensuring CFR compliance when errors occur, and optimizing scalable computing to the detriment of their scientific output. Many common genomic data analysis methods—performing GWAS and phenome-wide association studies (PheWAS) to identify potential connections between genetic variants and phenotypes, for example—pose an increasingly computationally challenging task. A data analytics engine that offers workflow or dataflow optimization without requiring user involvement would be a valuable option.

Organizational And Individual Challenges

What I frequently see is that organizations struggle to provide their scientists with a unified platform for systemic analysis of biobank data at multiple levels simultaneously, in a way that balances respecting patient privacy and sharing data appropriately with other collaborators. Traditional database management systems, distributed computing systems, data lakes and file-based systems all exhibit constraints that limit the value to be gained from genomic biobank analysis. For example, they are often too slow to process heterogeneous scientific datasets, and a single powerful server could easily spend weeks computing the results of one large-scale GWAS or PheWAS study. Furthermore, as systems grow in complexity, databases can easily become isolated and data sharing between systems is restricted.

In addition, when speaking to individual researchers, it’s clear that they require a solution that can reduce the time-to-result by orders of magnitude and provide them with the agility to allow many computations to be completed interactively within a single workday—and at a reasonable, predictable cost.

Bringing Biobank Data To Bear On Barterr’s Syndrome

I can bring this down to a really practical level by relating a conversation I had recently with an old friend, currently working in the Department of Biological Sciences, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America. He told me about a project he was involved with looking at individual amino acid variations on a protein that have been shown to have a connection to Barterr’s syndrome—a group of rare genetic disorders in which there are specific defects in kidney function caused by defective ion transport mechanisms. There are currently no treatment options available.

He explained that the renal outer medullary potassium (ROMK) channel protein is encoded by a single gene (KCNJ1) and is essential for potassium transport in the kidney. Studies by the group (https://doi.org/10.1371/journal.pcbi.1007749), had employed a ROMK-dependent yeast growth functional assay and tested single amino acid variants selected by a series of computational tools representative of different approaches to successfully predict each variants’ pathogenicity.

In our discussion, we considered how amino acid variants might be identified and tested to see if they were implicated in the disease. Traditionally, in silico predictions or costly and time consuming ‘divination/trial and error’ approaches, based on structural analysis of protein regions thought likely to be where significant amino acid variations might occur, are the norm. But we hypothesized that the power of biobank data might help to answer the key question: “Are there uncharacterized disease-associated ROMK mutations of KCNJ1 in the UK Biobank database?”

To test this, association analysis of a subset of around 337,000 individuals in the database was completed. Three novel, uncharacterized variants were identified. They were all characteristic of classic clinical manifestations of Bartter’s syndrome and subsequently confirmed to be deleterious, with one showing a growth defect in the yeast ROMK functional assay. The results were presented as a poster at the recent ASHG meeting.

Computer Science Delivering Biological Insight

Importantly, this is just one of many examples I have seen—or been part of—where, with the right bioinformatics tools, the power of biobank data can make a valuable contribution to answering fundamental questions.

The team at Pittsburgh plans to continue to explore biobank data as they look to identify additional disease-associated variants, and I am excited to follow how they, and others who have the same vision for the future of bioinformatics, continue to drive advances in disease understanding, drug development, and healthcare outcomes.

Zachary Pitluk, PhD., VP of Life Sciences at Paradigm4 (www.paradigm4.com), has worked in sales and marketing for 23 years, from being a pharmaceutical representative for BMS to management roles in Life Science technology companies. Since 2003, his positions have included VP of Business Development at Gene Network Sciences and Chief Commercial officer at Proveris Scientific. Zach has held academic positions at Yale University Department of Molecular Biophysics and Biochemistry: Assistant Research Scientist, NIH Postdoctoral Fellow and Graduate Student, and has been named as co-inventor on numerous patents. He can be reached at zpitluk@paradigm4.com.