By Malorye A. Branca
October 15, 2002 | Proteomics pioneer Ruedi Aebersold has released a new software tool that promises to help the field avoid some of the consistency issues that have plagued gene expression analysis.
Researchers led by Aebersold at the Seattle, Wash.-based Institute for Systems Biology (ISB) have developed a novel program that judges the accuracy of peptide sequence matches made by database search algorithms for tandem mass spectrometry data. A paper describing the program and a study in which it was tested were posted online at the American Chemical Society’s Analytical Chemistry Web site in August. Aebersold, who would like to see the tool become a standard for identifying proteins, is a co-founder of the ISB and a faculty member there.
This program addresses a key step in proteomics. The data produced by mass spectrometers are spectra that represent different protein fragments, or peptides, found in a biological sample. That information can be used to identify the proteins, if the peptides can be accurately matched to a protein sequence.
Several programs, including SEQUEST, Mascot, and Sonar, can make peptide assignments. These algorithms pore through the peptides in specific databases and evaluate their fit against each spectrum. The most likely fits are reported, but there is no widely accepted way to evaluate the accuracy of these matches.
The new ISB software employs a supervised, learned approach called Expectation-Maximization that allows the program to “teach” itself what constitutes a good match versus a bad one. “We borrowed heavily from the experience of the sequencing community in doing this,” says Aebersold.
The process is relatively straightforward. “Our program takes the assigned peptide sequence and score for each spectrum in a data set, and then calculates the probability that each of those assignments is correct,” says Andrew Keller, ISB research scientist and co-first author on the study. The software can be downloaded from the ISB Web site.
“This program saves a lot of time, since you can be certain that the matches with a high probability score are real,” says Aebersold. “If everyone in the community was using this, it would essentially standardize what it means to have a protein identified.” He has recently been trying to rally support for proteomics data standards.
That’s no small matter. Gene expression data, for example, are plagued by consistency issues, because of the large number of different types of chips and analytical methods. “This is one of the central issues in proteomics,” says Ken Miller, proteomics product marketing manager at ThermoFinnigan, in San Jose, Calif. “We have this great instrumentation, but it generates spectra and the job is making heads or tails of these lists [of peptides sequences].”
Such standards will be particularly important as proteomics researchers move away from studies focused on identifying the proteins in a single sample, and begin to do more experiments that involve sample-to-sample comparisons.
Although this is the first widely available program of its kind, other groups are developing similar software. “We have a very significant effort along those lines,” says Miller.
Standards are always a contentious topic, and debate will surely arise about whether the ISB group’s software should become the method of choice. But the idea of a standard is widely regarded as a good one.
“Everyone’s life would be much simpler if there was a standard means of identifying proteins,” says Miller. Informax Inc.’s CEO Andrew Whiteley concurs, “The emergence of standards will drive informatics in the future, because people want to compare and contrast results from different experiments.”
A. Keller, A.I. Nesvizhskii, E. Kolker, & R. Aebuersold. “Empirical statistical model to estimate the accuracy of Peptide Identifications made by MS/MS and Database Search.”