Drug Repurposing Strategy Without An ‘Answer Book’

By Deborah Borfitz

June 7, 2022 | Scientists working in the information biology lab at Hokkaido University (Japan) have come up with a drug repurposing strategy combining machine learning, unsupervised clustering of gene expression, and two-stage prediction. Machine learning, as used here, differs from standard approaches where performance of the method is measured by “repeated partitioning of data to see [the] distribution of sensitivity and specificity of prediction using known data,” according to team member Toshinori Endo, Ph.D., professor of information science and technology. “We wanted to explore an alternative that is independent of [an] answer book.”

Typically, the “hidden assumption” is that training data has been used to teach a machine learning algorithm to predict the outcome the model was designed to predict, he says. But for unsupervised learning, determining the boundaries between classes of data can be a bit more complicated. “Lack of apparent criteria is normally the case for real applications and that’s why a data analyst wants to apply machine learning, ironically.”

The new drug repurposing method, recently described in an article appearing in OMICS: A Journal of Integrative Biology (DOI: 10.1089/omi.2022.0026), may not be conclusive but it can “reasonably narrow the space for wet experiments,” says Endo.

For the study, the research team first used Uniform Manifold Approximation and Projection (UMAP), an unsupervised machine learning technique for dimension reduction, to form “dim clumps” from abnormal gene expression data, he explains. They then determined the boundaries by using the k-means method with Silhouette analysis that suggested the optimal number of classes. This served as “an objective viewpoint without a priori knowledge.”

Next, gene sets were filtered by statistical significance within clusters to search a drug database for compounds reversing the abnormal expression of genes, Endo continues. This drastically reduced the number of genes, making the task of finding and examining candidate compounds a “lightweight” search. For the final qualification step, the widely used Gene Ontology was searched to identify the responsible biological reasons (i.e., set of gene functions, subcellular localizations, or pathways).

The gene enrichment analysis step is not mandatory, he later adds, but “meaningful to see what happened inside.”

The Shortcut

So, why is this technique coming onto the scene only now? To answer, Endo quotes from Larry Wall’s “Programming Perl,” (Perl being a general-purpose programming language): “Three great virtues of a programmer are laziness, impatience, and hubris.”

Perhaps more to the point, he adds, would be “Let the machine do the dirty work,” as put by Brian W. Kernighan and P. J. Plauger in “The Elements of Programming Style.” Unfortunately, biological data demands a good grasp of biological phenomenon to interpret output performance.

Endo has done wet-lab work in molecular biology and data analysis in molecular evolution where he has seen accuracy and precision suffer despite the often-selfless devotion of scientists to getting it right. His takeaway was that “biological data should be carefully examined without assumption of reliability” and be evaluated from different axes. Although seemingly tedious, this is in fact a “shortcut,” he says.

Core Differences

Using this newfound shortcut, Endo and his colleagues analyzed gene expression data from 262 cases of 31 diseases and 268 controls to identify 20 drugs for repurposing against three diseases—body myositis, polymyositis, and dermatomyositis, all inflammatory conditions that cause muscle weakness. They were elected from the results of k-means clustering based on the overlap in expression of gene sets, most related to immune responses.

“A living cell expresses tens of thousands of genes for every moment and the spectrum of gene expression is expected to be different between normal and disease conditions,” he explains. “The spectra of gene expressions bear some fluctuation and error, [so] it is important to focus [only] on the significant core differences.”

UMAP simplified the differences and similarities among diseases with tens of thousands of gene expression onto a two-dimensional plain, so researchers could focus on the major discrepancies, says Endo. Interestingly, UMAP often magnifies local differences in gene expression like a fisheye lens, which he likens to cognition by the human brain.

In one of the gene expression clusters suggested by k-means, he notes, genes were identified with significant differences between normal and disease conditions. The LINCS L1000 characteristic direction signatures search engine was used for the subsequent hunt for drugs to reverse abnormal gene expression.

The drug repurposing strategy could broadly inform life sciences discoveries, clinical trials, and translational medicine, says Endo. “Once a pattern of differential gene expressions is obtained for a medication, candidates of applicable disease can be searched.”

Next Steps

The team is now considering applying their two-stage drug repurposing method to other public data in hopes of finding other candidates to construct a database system, Endo says. “By expanding the obtained knowledges, we may proceed to design drugs from scratch.”

Identifying the “hidden core of [an] expressed gene set” representing the dynamics of cell status could be a future target, he adds. Fluctuations in biological data can sometimes be “just a noise” reflecting a periodic cycle of cell status or response to an external signal.

“Single-cell RNA analysis will bring a variety of such data, which need classification depending on not only cell type but also the status of [the] cell,” says Endo. “Protein expression and its modification, as well as degradation, will also be needed to reveal the mechanisms of biological phenomena [with] the support of machine learning.”

One potentially major adoption hurdle with the drug repurposing technique is that some public data will first require inspection and reformatting, which is often hard to automate, he adds. Individual steps of the methodology are not particularly difficult.

The inspection and reformatting steps for the latest study were done manually, “one by one,” to confirm the data were processed properly, says Endo. This was in part because the task fell largely to a master course student and an undergrad who had little experience with this type of research and programming.

Constructing an automated pipeline could help others use the method, he says. But that will require the assistance of an experienced bioinformatician.