Understanding Proteins With Natural Language Processing

By Deborah Borfitz

April 28, 2021 | New research from the departments of chemistry and physics at Cambridge University has just helped demonstrate the potential of natural language processing (NLP) for understanding biomolecular condensates. Whether proteins undergo phase separation into membrane-less compartments, and under what conditions, is governed by “surprisingly rigorous grammar,” according to Tuomas Knowles, professor of physical chemistry and biophysics at the university and co-founder of Transition Bio, Inc, a company formed last fall to build better technologies for studying biomolecular condensates.

As described in an article newly published in PNAS, (DOI: 10.1073/pnas.2019053118), the grammar lessons included an “astonishing amount” of physics and chemistry that did not have to be ingested by the NLP algorithm, named DeePhase, he says. Researchers just asked the algorithm to learn the language of proteins and it outperformed several existing machine learning methods for predicting protein liquid–liquid phase separation (LLPS) using publicly available datasets.

The grammar rules referenced here describe the conditions and ordinal constraints around how cellular proteins form into biomolecular condensates. Understanding the molecular grammar of proteins, and spotting the mistakes, is a fascinating problem in fundamental sciences but also represents the first step toward the goal of developing drugs to make the needed corrections, Knowles says. The team is planning to carry out this drug development work in the context of Transition Bio.

The demonstrated ability of DeePhase to learn protein features that are not conventionally factored in “gives us huge confidence and optimism about extending this to other datasets and even understanding modulations of these condensates with small molecules,” he says.

Protein condensates can form and reform in the body and have been compared to blobs of shapeshifting wax in lava lamps. Their analysis has attracted a lot of scientific attention because the membrane-less aggregates control key events in the cell, including gene expression and protein synthesis, says Knowles.

It is also important to understand how condensates malfunction, he adds, because if they form in ways or in places they are not supposed to, or change from a liquid to a solid, they could potentially cause cancer or a neurodegenerative disease.

Doing the analysis in a hypothesis-free manner is new and could pick up the pace of advancement in the world of drug discovery and diagnostics, Knowles says. Comparing DeePhaseto the more conventional approaches allowed the research team to better understand how the algorithm reached its results to counter the “black box” problem often associated with NLP algorithms, he adds.

The project was a collaboration between Knowles’ research group, which specializes in protein biophysics, together with the university’s departments of physics and chemistry. Kadi Liis Saar, Ph.D., first author on the PNAS paper, Schmidt Science Fellow at the University of Cambridge and a member of the founding team of Transition Bio, was instrumental in bringing this expertise together, Knowles says. She was also the one who trained the model to look at what happens when something goes wrong with proteins inside the body to cause disease.

The quest to discover the physical rules governing the behavior of proteins is a long-term program, he notes, predating the formation of Transition Bio by 15 years. The other company co-founders are David A. Weitz, professor of physics and systems biology at Harvard University, and Peter St George-Hyslop, professor of experimental neuroscience at University of Cambridge and professor in the department of medicine (neurology division) at the University of Toronto.

Learning The Connections

The ability to learn about the fundamental chemistry and physics of a process without having to use pre-existing knowledge as inputs is a “fundamentally new way of approaching the problem,” says Knowles. NLP is more typically used to understand human language so, for example, an email program can predict what a person wants to type next and automate the work. Similar algorithms are used by Netflix, Amazon, and Facebook to make educated guesses about what people might want or like based on their previous text-based search patterns.

DeePhase uses the same sort of algorithm to discover properties of the protein formation process, says Saar. “We don’t want to make assumptions about what is important,” pointing to the value of the unsupervised approach. The algorithm was fed the entire “encyclopedia” of known protein sequences to learn the connections.

An NLP technique known as sequence embedding was used to turn the extremely high-dimensional information into a few hundred, low-dimensional representatives (feature vectors) for predicting the properties of proteins, Saar explains. “At the same time, we aimed to make sure the embedding included all the key information in the original dataset.”

“We’re keen on being part of the effort to bring NLP to protein science and use [the algorithms] to solve key problems relevant to human health,” says Knowles. Finding features of proteins that guide phase separation was just the initial ask of DeePhase. The immediate next step, which the team is now actively working on, is to “understand what the cell is doing in a particular moment in time” when condensates develop—and the components involved across different cell lines.

Knowing the context would provide clues on how to influence this behavior and correct mistakes, he says. DeePhase could eventually be applied to “other absolutely crucial biological phenomena.”

That machine-learning technology is being used for research on cancer and neurodegenerative diseases like Alzheimer's, Parkinson's and Huntington's is an “absolute game-changer,” Knowles adds. “Ultimately, the aim will be to use artificial intelligence to develop targeted drugs to dramatically ease symptoms or to prevent dementia from happening at all."

The team has made the version of the software used in the paper available for non-commercial use.