Deeper Mining Of Proteomics Data Using Artificial Intelligence

December 9, 2022

Contributed Commentary by Bernard Delanghe, Thermo Fisher Scientific 

December 9, 2022 | Advances in mass spectrometry (MS), ultra-high performance liquid chromatography (UHPLC), and bioinformatics have transformed proteomics research in recent years, affording greater understanding of biological systems at the proteome scale. The growing power of these technologies has contributed to the widespread application of bottom-up proteomics strategies, based on the characterization of proteins by analysis of peptides released through proteolytic digestion. However, challenges in proteomics data analysis remain, with traditional search strategies limiting the ability to fully interpret data. 

Artificial intelligence (AI) provides enormous opportunities to accelerate proteomics research through more comprehensive peptide and protein identification and quantification. Here, we look at how advances in AI are facilitating more efficient proteomics data acquisition, helping scientists generate more insight from both existing and newly acquired data. 

Augmenting Traditional Database Searching With Artificial Intelligence 

Bottom-up proteomics approaches typically rely on matching tandem mass spectra of proteolyzed samples with theoretical, computationally generated peptide fragmentation patterns, a process known as database searching. Traditional search engines used for this purpose score peptide spectral matches (PSMs) based on the matching of fragment ion masses. However, the ability of current search engines to draw on experimentally observed fragmentation patterns and fragment ion intensities is severely limited. While spectral library searching against libraries of experimental spectra is possible for some proteomics strategies, its use is typically applied to targeted or data-independent acquisition (DIA) experiments, rather than data-dependent acquisition (DDA) workflows. 

AI is helping to overcome these challenges by augmenting traditional search strategies with additional figures of merit to improve the identification and quantification of proteins across a wide range of workflows. For example, recent years have seen the development of AI-based intelligent search algorithms, built into commercial proteomics analysis software, that enable more accurate prediction of peptide fragment ion intensities. By generating additional intensity-based scores to supplement traditional database searching, these intelligent algorithms are increasing confidence in identifications to improve search results. 

Deep learning-based rescoring workflows capable of calculating intensity-based scores by comparing PSMs with predicted spectra are already allowing researchers to explore the proteome in greater clarity. By supporting more accurate identifications, such algorithms enable more confident matching, which is particularly useful for applications involving large search spaces and when the separation between targets and decoys is very small, such as in immunopeptidomics and metaproteomics research. In one immunopeptidomics dataset, for example, the use of AI-based rescoring resulted in a 50% increase in identified peptides compared with traditional approaches. These impressive results highlight the potential of deep learning for deeper, more comprehensive proteomic analysis. 

Deconvolution Of Chimeric Spectra With Artificial Intelligence 

Another major challenge in peptide identification by tandem MS is the problem of chimeric spectra, where two or more precursor ions with a similar retention time and mass are co-isolated and fragmented. This problem is substantial, with more than 40% of spectra in DDA datasets estimated to contain more than one peptide. The proportion can be even higher when working with shorter gradients and higher protein loads. Traditional search engines struggle to identify more than a few PSMs in these chimeric spectra, resulting in a failure to fully interpret the data generated by mass spectrometers.  

Intelligent search algorithms that combine deep learning with powerful spectral deconvolution are capable of more accurate peptide spectrum and retention time predictions. Such algorithms enable deeper mining of proteomics data, increasing the number of PSMs found in DDA data compared with traditional techniques, leading to fewer spectra returning no PSMs, and many spectra returning three or more PSMs. As a result, intelligent search algorithms can greatly improve the number of unique peptides and proteins researchers can identify and quantify in typical proteomics datasets. Workflows utilizing these algorithms have been shown to yield as many as 70% more proteins, more than twice as many unique peptides, and approximately five times as many PSMs in a typical 90-minute run compared with standard workflows. 

Optimizing Acquisition For Faster, Fuller Proteome Coverage 

As traditional search engines often struggle to deconvolute chimeric spectra, DDA proteomics experiments have commonly relied on narrow isolation windows to reduce the co-isolation of multiple peptides. However, using narrow isolation windows limits instrument utilization, reducing performance and efficiency. 

AI-based search algorithms provide an opportunity to rethink the analysis of tandem mass spectra, improving coverage and efficiency in proteomic research. By efficiently deconvoluting complex chimeric spectra, intelligent search algorithms enable optimization of DDA methods by permitting the use of wider isolation windows. This allows the identification of more PSMs per spectrum, enabling increased instrument utilization with more identifications per scan event. Combining wide window acquisition and intelligent search algorithms in this way, researchers can achieve greater proteome coverage using the same run times, or similar coverage with substantially higher throughput. 

A Powerful, Flexible Framework To Optimize Proteomics Data Analysis 

AI is transforming proteomics data analysis, with intelligent search algorithms substantially increasing research capabilities. By overcoming major limitations in proteomics research, these innovative solutions are opening new possibilities for DDA techniques to increase instrument utilization and optimize throughput and proteome coverage. 

Bernard Delanghe is currently the Product Marketing Manager Omics Software at Thermo Fisher Scientific in Bremen, Germany. His primary role is the bioinformatics market development, including managing strategic collaborations to develop new algorithms as well as complete workflows. In a previous role, Bernard Delanghe was Product Manager FT-MS, and Product Manager Biosoftware. During that time, he was responsible for Proteome Discoverer software for which he received in 2018 the Industry Award from the European Proteomics Association (EUPA). He joined Thermo Fisher Scientific in 2003 after having worked for several years at Applied Biosystems in Sales, Support and Business Development. He has over 20 years of experience in Mass Spectrometry and Bioinformatics. He can be reached at