YouTube Facebook LinkedIn Google+ Twitter Xingrss  

Swinging Through the Proteomic Data Jungle

By Kevin Davies

HUPO stresses the need for standardized analysis methods and a workflow focus.

Dec. 17, 2007 | Data processing, validation, standardization, and protein quantifications were among the central themes of the 6th annual Human Proteome Organisation (HUPO) World Congress, held in Korea in October.

HUPO was founded in February 2001, the same week as the publication of the first draft of the human genome. The organization's council now has 48 members from 19 countries, with its headquarters located at McGill University and Genome Quebec Innovation Centre in Montreal. HUPO has 2000 founding members from 69 countries.

According to Bruker Daltonics' director of bioinformatics, Herbert Thiele, "We've all learned that in bioinformatics, we have to address different proteomics workflows. The extreme complexity of the proteome calls for different multistep approaches." These are usually combinations of electrophoresis and liquid chromatography (LC) techniques in combination with different MS and MS/MS methods.

"Any kind of software solution for data warehousing and analysis should address these different workflows in a flexible manner," says Thiele. Bruker's ProteinScape platform supports various discovery workflows through a flexible analyte hierarchy concept, as well as addressing scientists' needs in biomarker profiling and quantification. Thiele points out: "A database solution is the only way to compare experiments to one another and to extract knowledge based on past experiments."

"Quantitation is becoming more and more important," says Thiele. "All vendors are working hard on quantitation tools." ProteinScape fully supports all current label chemistries for protein quantification, and the software will handle future label technologies. Interactive validation of protein quantification based on raw LC/MS data is now simple and straightforward.

Recent improvements in MS instrumentation make a label-free MS-based quantification approach feasible. This technology has the potential to become a significant complement to current quantification methods, such as label based MS methods. The high throughput compatibility of a label-free approach allows large numbers of samples to be processed. Handling these workflows from data preprocessing to statistical validation of quantification results is a big challenge.

Brain Proteome Project
An important takeaway from HUPO 2007 was the need for standardized analysis methods and result validation techniques. One of nine official global HUPO consortia is the Human Brain Proteome Project, headed by Helmut E. Meyer (Medizinisches Proteom-Center, Bochum, Germany), aimed to map the "proteomic landscape of the brain" using mouse and human samples " to get deeper insights into neurodegenerative diseases, and produce an inventory of proteins in the human brain."

The brain consortium established guidelines for data processing for protein identification, which Thiele calls a "very important step forward." It will allow researchers to "compare results and statistical relevance for all generated data, within the huge jungle of proteomics data." A data warehousing system including a data processing pipeline is mandatory for data comparison and validation.

Thiele says: "The fundamental problem of protein identification is [that] you get a long list of potential protein identifications, but nobody tells you which proteins are actually correct. The decoy approach allows you to measure the rate of false positives by mixing artificial protein sequences into the database."

For protein identification and characterization, ProteinScape uses complementary tools. The use of different search engines provides automatic cross-validation of the identifications in parallel with improved sensitivity (resulting in more protein ID's). The resulting peptide identifications are analyzed by the ProteinExtractor tool. This can even merge data from different search engines as well as from different experiments (ESI and MALDI), producing an integrated result. The use of decoy strategies minimizes the need for manual validation.

"In the near future," Thiele continues, "it will be a must that all protein identifications will come with a statistical significance, so everyone can judge the validity of the information. We need reproducibility and standardized ways to create confidence in the generated results."

Much like the genomics arena, the variety of LC/MS mass spec techniques in proteomics is producing vast volumes of data, posing two major issues in bioinformatics. First, "Do we need all the raw data in the database?" To cope, the processing pipeline has to be able to condense the data, and dedicated software tools must validate the results. "These tools should be able to visualize selected raw data, and correlate the results," says Thiele. Especially for applying quantitation algorithms, access to MS raw data is mandatory to make sure the information contained in the raw data is not disturbed by processing."

The other data handling issue concerns software for data visualization based on different workflows, for example, gel images and LC/MS data sets. "For the huge map of LC/MS data, a user-friendly navigation through large volumes of data is needed," says Thiele. "You need visualization tools for fast multi-resolution visualization of the data as an image ensuring seamless transition from a global overview of all spectra to selected isotopic peaks." Examples include MSight from Gene Bio, and SurveyViewer from Bruker.

Machine-Readable Experiments
Of course, producing large protein lists is not the end point in proteomics research. To enable result assessment and experiment comparison, the experimental conditions must be documented in a concise, reproducible, and also machine-readable way. This is done by PRIDE (PRoteomics IDEntifications database at the European Bioinformatics Institute,

Thiele says: "The ideal would be to handle, distribute, and archive proteomics data in a data repository and incorporate the publishers of science journals to set up specific guidelines. In the past, all manufacturers had their own file formats, with software running just on the vendor's machine. Nowadays, the vendors are participating with consortia to support initiatives in data standardization." That helps researchers generate data on one instrument and use dedicated software tools to turn data into knowledge.

The European Commission-funded ProDaC consortium (Proteomics Data Collection) will finalize data storage and standards, implement conversion tools, and establish standardized submission pipelines into the central data repository. For example, the Brain Proteome Project has already uploaded the ProteinScape data reservoir into PRIDE.

In Thiele's opinion, IT has an important role both in "computer clustering and computer grid technology. Automatic parallel processing of large MS data sets in a distributed computing environment, combining compute resources at different locations to do specific tasks, is the great challenge in near future."

Sidebar: Gold Standard
Another HUPO-related data validation initiative involves Invitrogen, which is launching the HUPO Gold Mass Spectrometry protein standard sampling program.

Designed to serve as the first commercial, all-recombinant human protein standard for mass spec, the HUPO Gold MS standard is a defined mixture of known human proteins that can serve as a benchmark to judge data quality and allow researchers to cross-reference their results. The standard will work regardless of the type of mass spectrometer used.

"With a variety of published mass spectrometry workflows, as well as the large number of instruments and data-analysis software packages available for use, researchers today face major challenges validating and comparing their published data," said John Bergeron, chair of HUPO scientific initiatives. The new standard, he says, together with HUPO training, "should lead to field-generated data of greater run-to-run accuracy and reproducibility."

Paul Predki, Invitrogen's VP of R&D, adds that current mass spec standards could contain contaminants or vary slightly in mass based on natural genetic variations. "We have designed a valuable resource that will aid scientists in making their substrate identification more definitive and will allow them to reference their efforts on a global research scale," says Predki.

The HUPO Gold MS Protein Standard samples are available to HUPO members, with the full release set for early 2008.
Subscribe to Bio-IT World  magazine.

Click here to login and leave a comment.  


Add Comment

Text Only 2000 character limit

Page 1 of 1

For reprints and/or copyright permission, please contact  Jay Mulhern, (781) 972-1359,