A More Efficient Approach to Biostatistical Data Analysis



By Kristen Amuzzini

Sept 15, 2005 | Traditional biostatistical analysis focused on identifying and understanding the functions of individual genes, proteins, and cells. Today, the emphasis has shifted to studying the organism as an integrated network of genes, proteins, and biochemical reactions.

The information needed to fully understand the functioning of an organism typically comes from a wide range of sources and instruments: DNA microarrays, for evaluating the expression of a large number of genes; mass spectrometers, for identifying larger, more complex proteins; nuclear magnetic resonance (NMR) spectrometers, for identifying smaller molecules such as metabolites; and Web-based molecular biology databases from organizations such as the National Center for Biotechnology Information (NCBI).

This multiplicity of information sources complicates the job of the biostatistician. Data obtained from the many sources used in biological research today typically come in incompatible formats. For example, most instrument manufacturers provide software that can be used only to interpret and analyze the data produced by their instruments. These software packages rarely support the latest data analysis algorithms. As a result, data from different sources must be analyzed separately in their native environment, a fragmented approach that makes it difficult to gain a systems-level understanding of an organism.

The increased diversity of data sources demands more flexible biostatistical analysis methods, as well as software tools that enable the interactions of the numerous genes, proteins, mechanisms, and the organism’s external environment to be integrated, analyzed, and visualized in a single environment.

This approach uses statistical analysis and visualization tools in the MATLAB software environment, making it easy to access data from a wide range of data sources — for example, sequence data in standard formats, such as FASTA and PDB; microarray data from Affymetrix, Agilent, and other platforms; and information from major Web-based databases, such as GenBank and NCBI BLAST. The new tools provide genomic and proteomic data formats, analysis techniques, and specialized visualizations for genomic and proteomic sequence and microarray analysis.

Using the optimized approach, the biostatistician can take advantage of the strengths of many different instruments and data sources without investing time in manual data processing. For example, a single statistical analysis can be performed on a data set containing microarray, mass spectrometry, and NMR data.

Analyzing and Visualizing Microarray Data
The example discussed here uses microarray data from voxel A1 of the brain of a mouse in which a pharmacological model of Parkinson’s disease was induced using methamphetamine.*

The data are read into the MATLAB workspace, where a spatial plot of the microarray image is created, together with a field showing median pixel values in the various color channels. Spatial effects are readily apparent in the background intensities. The data are normalized to remove this spatial bias. Next, scatter plots of the microarray data are generated to measure expression levels. Points above the diagonal in this plot correspond to genes with expression levels that are higher in the A1 voxel than in the brain itself.

These same analysis and visualization tools are used to process raw mass spectrometry data, which are stored in text files with two columns, the mass/charge (M/Z) values and intensity values corresponding with the M/Z ratios. Our example uses spectrograms taken from one of the low-resolution ovarian cancer NCI/FDA data sets. The spectra are generated using the WCX2 protein-binding chip. Re-sampling the mass spectrometry data homogenizes the M/Z vector, making it possible to compare different spectra under the same reference and at the same resolution.

Mass spectrometry data usually show a varying baseline, caused by chemical noise in the matrix or by ion overloading. An integrated data analysis and visualization environment makes it easy to estimate a low-frequency baseline, which is hidden among the high-frequency noise and signal peaks and can be subtracted from the spectrogram.

By the same token, mass spectrometers that have been calibrated differently exhibit variations in the relation between the observed M/Z vector and the true time of flight of the ions. There, systematic shifts can be observed in repeated experiments. Misaligned spectrograms can be corrected by providing a set of M/Z values where reference peaks are expected to appear. A heat map is used to observe the alignment of the spectra before and after applying the alignment algorithm. In these experiments, systematic differences are observed in the total amount of desorbed and ionized proteins. To compensate for this, the relative intensities of the spectrograms are normalized. After preprocessing, the data are ready for biomarker detection, which can be performed with the same analysis tools.

Using this efficient, integrated approach, data from many different sources can easily be acquired, massaged to ensure its integrity, and then combined into single tables that biostatisticians can use to search for patterns regardless of the source of the data. The completed statistical analysis application can be deployed to researchers as an Excel spreadsheet or as a standalone executable with a graphical user interface.
__________________________

*Brown, V.M. et al. "Multiplex three-dimensional brain gene expression mapping in a mouse model of Parkinson's disease." Genome Res 12, 868-84; 2002.

Kristen Amuzzini is the biotech and pharmaceutical industry marketing manager for The MathWorks where she specializes in computational biology. Kristen spearheads an effort to foster industry and academic adoption of MATLAB based tools for biological data analysis. She would like to thank Sam Roberts and Rob Henson of The MathWorks for their contributions to this article. E-mail: kristen.amuzzini@mathworks.com.

Click here to login and leave a comment.  

0 Comments

Add Comment

Text Only 2000 character limit

Page 1 of 1



White Papers & Special Reports

sgi whp 2
Managing the Modern Genomics Data Flood
Sponsored by SGI

Managing and storing the perfect storm of multi-disciplined data pouring from next generation sequencers and other omics instruments is a central challenge in life sciences. Discover in this paper how the SGI ArcFiniti storage solution, optimized for unstructured genomics and life sciences data can: 

  • Reduce costs, proactively protect data integrity, and deliver the high performance I/O required for genomics data processing and analysis.  
  • Effectively manage capacities from 156TB to 1.4PB as a disk based, integrated hardware and software platform 


sgi - whp 1
Turning Genomics Data into Practical Insight
Sponsored by SGI

With worldwide sequencing capacity approaching 13 quadrillion DNA bases annually turning genomics data into knowledge is a true computational challenge. Read this paper and learn how the SGI UV coherent shared memory platform can:  

  • Speed results time while cost competitively tackling the most difficult computational problems across all omics disciplines. 
  • Push performance by scaling to extraordinary levels, up to 256 sockets (2,560 cores, 4,096 threads) per single system (one OS image). 

Provide support for up to 16TB of coherent shared memory in a single system image enabling extreme efficiency across a wide range of compute demands. 



accerlys-logo_2012_wh
New Complimentary Market Survey…
Collaborations and Communications Within Drug Discovery Research
Sponsored by Accelrys
This survey was conducted by the Cambridge Healthtech Media Group in January, 2012. It was sponsored by Accelrys related to their HEOS initiative to gather valid information around externalizing collaborative research while improving communications in the cloud. With 310 qualified industry respondents the survey findings reveal useful usage and trends patterns.  An insightful follow-on discussion and webinar related to this survey, and the HEOS by Scynexis SaaS portal is also available on the Bio-IT World website for complementary viewing.
 


Job Openings

tessella logo 
Scientific Software Engineer
Boston MA
$70,000 to $95,000
 
Apply at http://jobs.tessella.com   

oxford nanopore logo 


Early Access Collaborations ManagersClick here to find out more and apply   

Oxford Nanopore's GridION technology, VP, Sales and Marketing Click to  Apply  

For reprints and/or copyright permission, please contact  Tim McLucas, (781) 972-1342, tmclucas@healthtech.com .