Is Precision Medicine Ready For Data From Billions Of Cells?
Contributed Commentary by Zachary Pitluk
February 3, 2021 | Popular opinion is that two big shifts have changed the game for bioinformatics. The first is a technology development—single cell ‘omics—that began just over 10 years ago, the other is a more recent move to a new model of drug discovery that looks to leverage the vast amounts of human genetic data now being generated in support of precision medicine.
Life science research continues to be transformed by single cell ‘omics, notably genomics, transcriptomics, epigenomics, and proteomics at the single-cell level. The technique for examining all the gene products in one cell at one timepoint—single-cell RNA sequencing (scRNA-seq)—was established in the Gurdon Institute at the University of Cambridge, UK, in 2009. Fast forward to 2019, when The Klarman Cell Observatory at the Broad Institute profiled 29.4 million cells!
However, despite this significant advancement, it is vitally important to realize that current single-cell datasets have been generated from a small number of individuals, and that statistical significance relies on the number of patients studied, rather than the number of cells. This is because cells from the same patient are “siblings” and not true biological replicates. A good sampling of genetic diversity which influences gene expression is the reason for the large number of individuals in the UK Biobank and other biobanks.
Consider this: the Immune Cell Survey in the Human Cell Atlas (HCA consortium) currently contains 780,000 cells. But these are from only 16 individuals. The HCA itself has fewer than 400 patients in total, with very few donating cells from more than one organ system.
The ‘cells-patients-data’ relationship is further compounded as researchers study the evolution of cell populations and the effects of treatments over time. To add to this mind-bending scale, we must also consider the complexity that the expanding matrix of data types being combined—data from imaging, wearables, and biochemical measurements, for example—brings.
The bottom line is that to achieve datasets with the 100,000s of patients and treatment conditions that are needed for statistical power in pharma R&D and precision medicine, IT tools to manage billions of cells are essential.
So, Is The Bioinformatics World Ready?
Looking to the literature for a consensus of the challenges that need to be addressed to cope with this avalanche of data, I often return to three significant publications—the 2017 whitepaper from the HCA consortium, a provocative opinion piece from 2018 that discusses the need for data triangulation between different measurement modalities to test causal hypotheses (DOI: 10.1093/ije/dyw314), and a thoughtful review entitled “11 Grand Challenges in single cell data analysis” published in 2020 by an eminent group of researchers (DOI: 10.1186/s13059-020-1926-6).
The message coming out of these seminal publications is clear—we need a rethink. They categorize the current state of the art as a series of craft methods that need to be transformed into more robust, higher throughput, and reproducible workflows if they are to support future research goals. For example, there is a growing requirement to juxtapose datasets from different biomolecules/assay modalities (such as imaging data with RNAseq data), but current approaches lose physiological context because there is no ability to do this. In fact, with most of these tools, even the simpler task of comparing the same modality across different studies is lacking.
Moreover, current methods require repetitive extract/transform/load operations, increasing time and costly computational overhead with every question asked of the data. There is no opportunity to load/QA once—interrogate often, or to run at large scales on cost-effective hardware.
Legacy systems also extend analysis time, and, at worst, tools will not run without significant modification. In fact, some commonly used algorithms do not scale beyond certain low patient number limits—in certain cases, fewer than 20. Researchers are often forced to use less accurate methods that reduce decision making power of data by combining cells before differential gene expression, crushing the ability to precisely define cell types and states.
New Structure, New Capabilities
A recent preprint publication between my company and researchers at BMS provides an illustration of how rapid insights and understanding can be gained from new IT tools (DOI: 10.1101/2020.06.24.169730).
Faced with a large collection of single cell analyses from various tissues relevant to COVID infection, we created and curated a dataset populated with COVID Cell Atlas data and Human Cell Atlas data. The data—2.2 million cells, 32 projects—were all normalized in order to establish the cellular and tissue distribution of the disease targets, and to allow searching to find the distribution of transcripts for ACE2 and TMPRSS2—the primary receptors for COVID-19 infection.
In this case, searches were able to be completed within seconds, a time scale appropriate for precision medicine decision making.
Zachary Pitluk, PhD., VP of Life Sciences at Paradigm4, has worked in sales and marketing for 23 years, from being a pharmaceutical representative for BMS to management roles in Life Science technology companies. Since 2003, his positions have included VP of Business Development at Gene Network Sciences and Chief Commercial officer at Proveris Scientific. Zach has held academic positions at Yale University Department of Molecular Biophysics and Biochemistry: Assistant Research Scientist, NIH Postdoctoral Fellow and Graduate Student, and has been named as co-inventor on numerous patents. He can be reached at firstname.lastname@example.org.