Natural Language Processing Designs Original Proteins From Scratch
By Deborah Borfitz
February 16, 2023 | Using a natural language processing (NLP) program developed by Salesforce Research, scientists have engineered artificial enzymes that look and fold just like natural proteins and function just as well. The artificial intelligence tool, called ProGen, allows information from the evolutionary record of multiple protein families to generate new enzymes that could prove therapeutically useful, according to James Fraser, Ph.D., professor of bioengineering and therapeutic sciences in the school of pharmacy at the University of California, San Francisco.
The NLP approach, as described in an article that published recently in Nature Biotechnology (DOI: 10.1038/s41587-022-01618-2), involves “next-token prediction with a control tag” where every amino acid (token) added to a polypeptide chain could be treated like a word in a sentence but in the context of the type of protein scientists are trying to create, explains Fraser. It’s akin to understanding aspects of language except each token is an amino acid molecule, rather than a word on a particular topic, influencing the probability of all other amino acid molecules being in the final generated product.
In the paper, the authors used ProGen with control tags specifying the family of lysozymes—what Fraser likens to a general review versus a rating system of one to five stars. “We haven’t yet explored the power to generate a lysozyme that’s thermal-stable or acid-stable or that would work really well under high pressure or other [conditions],” he says.
But in experiments on the immediate horizon, Fraser quickly adds, ProGen will be asked to generate functional proteins based on preferences for having amino acids in different positions. “These are complex interdependencies that are a little bit black-box to us, but the model has learned that fixing an amino acid at one position influences the presence of an amino acid in another position.”
Longer term, he expects to see an integration between the worlds of structure-based reading (e.g., AlphaFold and RoseTTAFold) and sequence-based writing (e.g., ProGen) to produce improved, next-generation proteins. The protein engineering field is “incredibly fast-moving,” and most of the attention up to now has been going to better ways of modeling protein structure explicitly, as well as more recent diffusion models of protein structure prediction and design.
Those efforts are all aiming to model the structure because “structure determines function,” Fraser says. ProGen is inverting the problem by starting with the desired function and then using NLP to write out a plausible sequence to have that function.
Researchers have been using NLP models over the past couple of years to hypothesize proteins that could be made, says Fraser. “What’s new here is the fact that we actually made them, and that they work... as frequently as a random sampling of natural proteins.”
Some of them also work as well as the best natural proteins and “for the right reasons,” Fraser adds. By determining their three-dimensional (3D) structure experimentally, “we showed that they actually look and fold just like natural proteins.”
Why these artificial designs perform much better than designs inspired by the evolutionary process remains a mystery, says Fraser. But his favored rationale is that ProGen allows protein engineers to “cross-pollinate” insights between protein families.
“When we consider the evolution of a single protein family, we only have at maximum of about four billion years to draw on,” he continues. “One of the cool things about ProGen is that it allows us to share information about the evolutionary record of one family with another family with another family and it learns some of those preferences and peculiarities of proteins that we wouldn’t be able to understand by just studying one family.”
Deep Dive On Five
ProGen was trained on 280 million protein sequences from more than 19,000 families and is augmented with control tags specifying protein properties, such as protein family, biological process, and molecular function. It was then fine-tuned with 56,000 more sequences from five lysozyme families, along with some contextual information about these proteins.
The amino acids in a chain are ordered from the amino-terminus to the carboxyl-terminus and “it ends with a period just like a sentence,” as Fraser explains it. “The only difference is the ‘sentence’ is 200 to 400 words long, in a lot of cases, and has all these elements [analogous to independent clauses and parentheses] of secondary structure or how different amino acids will face each other in the three-dimensional sequence and interact.”
ProGen is learning the “grammar and semantics” of the underlying amino acids across all proteins, he says, as well as the peculiarities of a family when given a specific control tag like a lysosome. “It is then able to express and generate sequences that obey in that respect the [rules] of the natural family of proteins while being generated in a process that exists completely outside of evolution. So, they don’t make sense on a phylogenetic tree, but they do make sense in terms of the way they function.”
For the latest demonstration, the research team made 100 proteins in the lab using the evolution-based approach and another 100 using ProGen. They were then compared to 100 randomly selected natural sequences.
The five most active artificial proteins generated by ProGen were then rigorously characterized by performing a systematic survey of their kinetic properties to learn how fast they’d chew up a bacterial cell wall, Fraser says. Some of them were as active as the most active lysozyme—a hen egg lysosome, the first enzyme to have its 3D structure determined back in the 1960s that since then has been the model system for studying protein structure and function. One of the newly generated artificial proteins likewise had its 3D measurements taken with X-ray crystallography, he adds.
Scientists typically “guess” at a protein’s function by comparing its sequence to the database of sequences. That inference “starts to get a little fuzzier” when the match is something less than 40%, points out Fraser.
It was considered a good sign that the artificial enzymes generated by ProGen fell on the lower end of the matching spectrum because it suggests that the technology was generating new proteins “barely recognizable as lysozymes,” Fraser continues. It is important to note here that high percentage identities for natural proteins tend to occur in clusters that derive from the evolutionary process, he adds.
“A chicken lysozyme is going to be more like a goose lysozyme and a duck lysozyme... [while] a human lysozyme is going to look more like a chimpanzee lysozyme,” says Fraser. “Ours might be high identity to just a chicken lysozyme, but... it has these weird properties where it can have high identity to different parts of the actual evolutionary tree but not be able to be placed on it.”
The source code for ProGen is openly available and has undergone several rounds of improvement over the past few years, Fraser says. The trick is in executing the deep learning framework, which is where Ali Madani, Ph.D., a former research scientist at Salesforce Research and first author of the latest Nature Biotechnology paper, comes in. Madani has since founded a startup, Profluent Bio, to build on this line of research. Fraser serves in an advisory capacity with Profluent Bio but has no operational role.
The company’s technology can easily generate working enzymes from scratch out-of-box, Madani says, which is remarkable given the almost limitless design choices. Lysozymes, which are relatively small proteins, can have up to 300 amino acids. But with 20 possible amino acids, the potential combinations are greater than the number of humans who have ever lived multiplied by the number of grains of sand on Earth, multiplied by the number of atoms in the universe.