A More Efficient Approach to Biostatistical Data Analysis


By Kristen Amuzzini

Sept 15, 2005 | Traditional biostatistical analysis focused on identifying and understanding the functions of individual genes, proteins, and cells. Today, the emphasis has shifted to studying the organism as an integrated network of genes, proteins, and biochemical reactions.

The information needed to fully understand the functioning of an organism typically comes from a wide range of sources and instruments: DNA microarrays, for evaluating the expression of a large number of genes; mass spectrometers, for identifying larger, more complex proteins; nuclear magnetic resonance (NMR) spectrometers, for identifying smaller molecules such as metabolites; and Web-based molecular biology databases from organizations such as the National Center for Biotechnology Information (NCBI).

This multiplicity of information sources complicates the job of the biostatistician. Data obtained from the many sources used in biological research today typically come in incompatible formats. For example, most instrument manufacturers provide software that can be used only to interpret and analyze the data produced by their instruments. These software packages rarely support the latest data analysis algorithms. As a result, data from different sources must be analyzed separately in their native environment, a fragmented approach that makes it difficult to gain a systems-level understanding of an organism.

The increased diversity of data sources demands more flexible biostatistical analysis methods, as well as software tools that enable the interactions of the numerous genes, proteins, mechanisms, and the organism’s external environment to be integrated, analyzed, and visualized in a single environment.

This approach uses statistical analysis and visualization tools in the MATLAB software environment, making it easy to access data from a wide range of data sources — for example, sequence data in standard formats, such as FASTA and PDB; microarray data from Affymetrix, Agilent, and other platforms; and information from major Web-based databases, such as GenBank and NCBI BLAST. The new tools provide genomic and proteomic data formats, analysis techniques, and specialized visualizations for genomic and proteomic sequence and microarray analysis.

Using the optimized approach, the biostatistician can take advantage of the strengths of many different instruments and data sources without investing time in manual data processing. For example, a single statistical analysis can be performed on a data set containing microarray, mass spectrometry, and NMR data.

Analyzing and Visualizing Microarray Data
The example discussed here uses microarray data from voxel A1 of the brain of a mouse in which a pharmacological model of Parkinson’s disease was induced using methamphetamine.*

The data are read into the MATLAB workspace, where a spatial plot of the microarray image is created, together with a field showing median pixel values in the various color channels. Spatial effects are readily apparent in the background intensities. The data are normalized to remove this spatial bias. Next, scatter plots of the microarray data are generated to measure expression levels. Points above the diagonal in this plot correspond to genes with expression levels that are higher in the A1 voxel than in the brain itself.

These same analysis and visualization tools are used to process raw mass spectrometry data, which are stored in text files with two columns, the mass/charge (M/Z) values and intensity values corresponding with the M/Z ratios. Our example uses spectrograms taken from one of the low-resolution ovarian cancer NCI/FDA data sets. The spectra are generated using the WCX2 protein-binding chip. Re-sampling the mass spectrometry data homogenizes the M/Z vector, making it possible to compare different spectra under the same reference and at the same resolution.

Mass spectrometry data usually show a varying baseline, caused by chemical noise in the matrix or by ion overloading. An integrated data analysis and visualization environment makes it easy to estimate a low-frequency baseline, which is hidden among the high-frequency noise and signal peaks and can be subtracted from the spectrogram.

By the same token, mass spectrometers that have been calibrated differently exhibit variations in the relation between the observed M/Z vector and the true time of flight of the ions. There, systematic shifts can be observed in repeated experiments. Misaligned spectrograms can be corrected by providing a set of M/Z values where reference peaks are expected to appear. A heat map is used to observe the alignment of the spectra before and after applying the alignment algorithm. In these experiments, systematic differences are observed in the total amount of desorbed and ionized proteins. To compensate for this, the relative intensities of the spectrograms are normalized. After preprocessing, the data are ready for biomarker detection, which can be performed with the same analysis tools.

Using this efficient, integrated approach, data from many different sources can easily be acquired, massaged to ensure its integrity, and then combined into single tables that biostatisticians can use to search for patterns regardless of the source of the data. The completed statistical analysis application can be deployed to researchers as an Excel spreadsheet or as a standalone executable with a graphical user interface.
__________________________

*Brown, V.M. et al. "Multiplex three-dimensional brain gene expression mapping in a mouse model of Parkinson's disease." Genome Res 12, 868-84; 2002.

Kristen Amuzzini is the biotech and pharmaceutical industry marketing manager for The MathWorks where she specializes in computational biology. Kristen spearheads an effort to foster industry and academic adoption of MATLAB based tools for biological data analysis. She would like to thank Sam Roberts and Rob Henson of The MathWorks for their contributions to this article. E-mail: kristen.amuzzini@mathworks.com.

Click here to login and leave a comment.  

0 Comments

Add Comment

Text Only 2000 character limit

Page 1 of 1

White Papers & Special Reports

waters integrated
Integrated Software Approach to Streamline Method Development Workflow
Sponsored by Waters
Traditional chromatography method development evaluates numerous chromatographic variables that may include selection of the column, solvent, flow rate, etc. This application note demonstrates an integrated approach to capturing and storing separation related reports during chromatography method development.


HP white paper image
Extreme Storage Knowledge Center
Sponsored by HP

Visit HP’s Extreme Storage Knowledge Center to find informative, complimentary white papers, case studies, videos, product information and more.  Brief overview of topics:

  • The challenges of unstructured storage and how to manage both cost-effectively and efficiently
  • Company case studies of data storage challenges that translate across pharmaceutical and biotech companies today
  • Systems that manage vast amounts of data with simple deployment, unified management, and extreme scalability at an exceptionally low price per terabyte
  • Life sciences data management; viable solutions for small and large companies to manage growing storage demands
  • Take our virtual product tour and see our storage unit from inside out


Coupa white paper 92
10 Secrets to Recession-Proof Your Business
Sponsored by Coupa


Read this white paper to discover 10 strategies smart companies deploy to recession-proof their business.
Leaders generally face hard choices on how to mange a company during an economic downturn and
behave in one of three ways:
1) “The ostrich” - Preserve the status quo/hope for the best
2) “The bull in the china shop” - Blindly cut expenses across the board
3) “The fox” - Use the downturn to make your business more effective and position it for future growth

Learn how to behave “like a fox” and use a recession as a means to pounce on emerging trends.



Life Science Webcasts & Podcasts

Medidata Solutions

Rising Clinical Trial Delays and Costs - Addressing the Cause, Not the Symptoms 

medidata podcastProtocol complexity is taking a toll on clinical study speed and efficiency: increasingly complicated and ambitious protocols are not only burdening sites and study volunteers but are also prolonging trials and increasing expenses. In response, sponsors have turned to global study placement, restructured site relationships and new site management practices, but the problem remains.

This podcast will discuss:

  • Why these responses address only the symptoms, not the underlying cause, of rising clinical trial delays and costs.
  • Results of a recent joint Tufts University / Medidata Solutions study.
  • New metrics benchmarking protocol design trends.
  • Systematic protocol design improvements and why they are essential to clinical trial performance excellence.

Speakers: Ken Getz, Senior Research Fellow at the Tufts Center for the Study of Drug Development, and Ed Seguine, General Manager, Trial Planning Solutions at Medidata.

Download Now 



More Podcasts

Job Openings

Manager, Scientific Computing & Programming
Lead SAIC-Frederick, Inc.’s Bioinformatics & Analysis Group in developing & maintaining informatics pipelines for generation/analysis of dense genotyping & next-generation sequencing data. Required:  MS or equiv.  5 yrs related experience.  Knowledge of programming/software development, high performance computing, bioinformatics, project management. Visit www.saic-frederick.com - #130019.

For reprints and/or copyright permission, please contact The YGS Group, 1808 Colonial Village Lane, Lancaster, PA;

(717) 399-1900 ext. 125, or via email to Ashley.Zander@theYGSgroup.com.