Deep Proteome Sequencing Answers Longstanding Biology Question

April 18, 2023

By Deborah Borfitz 

April 18, 2023 | Debate about whether RNA splicing events code for different proteins may finally be over, thanks to international research efforts and a new sequencing method that enumerates the massive number of human protein alterations in standard proteomics experiments. The meta-scale approach, which has produced the deepest proteomics map ever collected, could help drug makers better target their therapeutics to the right version of a protein, according to Joshua Coon, Ph.D., professor of biomolecular chemistry at the University of Wisconsin-Madison and an investigator at the Morgridge Institute for Research.   

“Technologies for proteome sequencing have lagged behind DNA and RNA sequencing and our knowledge of proteins has similarly lagged behind,” says Coon, lead researcher on a study published in Nature Biotechnology (DOI: 10.1038/s41587-023-01714-x) describing the game-changing “deep proteome sequencing” method that puts the three techniques on equal footing. The team also made a fundamental contribution to science that will likely be added to biochemistry textbooks—namely, that genetic mutations “make their way down to the protein level.”    

Using deep proteome sequencing, the team has created a publicly available resource where any scientist can learn the peptides and protein modifications associated with a particular gene. These include single nucleotide changes as well as alternative splicing events, where small segments of a DNA or RNA molecule called exons are joined in different combinations and can translate to produce different proteins with distinct structures and functions, explains Pavel Sinitcyn, Ph.D., a postdoctoral researcher in the Coon Lab as well a Morgridge interdisciplinary postdoctoral fellow. As many as 95% of human genes have this capability—and each one can code for literally hundreds of distinct proteins.  

For the study, the team analyzed the whole sequence of six different human cell types and six proteases that break down into peptides analyzed via mass spectrometry, Sinitcyn says. The unprecedented ability to characterize proteins “opens up a new dimension in all studies” seeking to understand how their dysfunction contributes to disease.  

The researchers identified more than 1 million peptides from 17,717 different protein groups. Notably, from these data, they were able to detect about 80% of the sequences of all individual proteins within those samples, he says. Using current proteomic technology, only about 20% of proteins can be sequenced. 

Shotgun proteomic methods, which use a small number of peptide proxies to determine the presence of an entire protein, make it impossible to globally identify proteoforms, the researchers say. While “top-down” proteomics allows the measurement of intact protein mass before disassociation for sequence determination, it has a lot of technical hurdles limiting its use.   

The spectrometry-generated data for the study was produced by the University of Wisconsin. Collaborators included the Max Plank Institute of Biochemistry in Germany, which did most of the algorithm development work, as well as the University of Toronto in Canada and the Garvin Institute in Australia, which provided RNA splicing expertise and aided in interpreting and formulating results, says Coon.  

Sinitcyn, formerly a Max Planck scientist, led the large-scale data analysis work for the project. He developed software solutions enabling the detection of single amino acid variants and alternative splicing in the mass spectrometry data.  

The initial hurdle was finding a way to account for the high probability of generating false positives, says Sinitcyn, given that the team in the Max Planck Institute was dealing with more than five terabytes of data from heterogeneous sources. Unlike DNA and RNA sequencing, proteomics does not provide direct sequence information. Rather, it relies on the random fragmentation of peptides, which requires a solid statistical interpretation. "Failure to handle the data with care can lead to errors in the analysis." 

Speeding Measurement  

Humans are estimated to have about 20,000 genes—the same number as dogs, cats, and worms—but they clearly have more RNA editing taking place, says Coon. Those modifications were thought to give rise to different versions of proteins, but up until now there was no way to check if those conversions in fact happened. “Our study confirms that probably the majority of them do,” answering a longstanding question in biology. 

The implications here are significant, he continues. If it can be verified that one gene has two different RNA versions—where, for example, one allele is missing a domain but the other is not—researchers can then distinguish their respective functions and their potential as therapeutic targets. The challenge is in dealing with all the newfound information.  

Alternative splicing is a rapid evolutionary adaptation mechanism, says Sinitcyn, and have been associated with stresses that include cancer and other disorders. They also serve important physiological functions and support tissue and organ development. 

Coon’s lab has set up an easy-to-use deep sequencing data browser where scientists can go to query any gene of interest and learn the alternative splicing changes associated with it, Sinitcyn says. The plan is to eventually extend the resource to more cell lines and tissues.  

Most immediately, the team will be attempting to streamline the measurement process, says Coon, noting that the deep proteome project required considerable effort by multiple people to collect the data and develop new computational algorithms to extract the information. Their current focus is on advancing mass spectrometry tools and technologies for deep data analysis.   

Deep protein sequencing could potentially be used to catalogue the extent of splicing at the protein level across different organisms, Coon says, which would presumably happen very little in, say, yeast. But the greater fascination right now would be the variability in cell type-specific splicing events within humans.