CONVERSATION · David Mount discusses the new edition of Bioinformatics, sequence analysis, and teaching.
By Kevin Davies
November 19, 2004 | David Mount is director of bioinformatics at the Arizona Cancer Center and the Southwest Environmental Health Sciences Center at the University of Arizona. His research interests include the development of computational tools for pattern analysis, gene expression, and proteomics data; the plant genome project; and the development of Web-based databases for cancer research (e.g., www.biorag.org). Mount is also the author of the widely used textbook Bioinformatics: Sequence and Genome Analysis. Having sold almost 20,000 copies of the first edition, Cold Spring Harbor Laboratory Press recently released a second edition. Mount spoke with Kevin Davies about his priorities and challenges in teaching bioinformatics.
Q: The second edition of your book is just out. So the first edition did OK?
A: The first edition sold very much beyond our expectations. We hadn't appreciated the need for a textbook for all of the new bioinformatics courses that were starting up, and there weren't many other textbooks available. The book was advanced enough to serve as a reference text.
Who is your primary audience? Biologists who want to learn computational skills, IT students who want to explore biology, or both?
David Mount says Bioinformatics is written for biologists as well as computational students.
We try to appeal to both audiences by giving enough details of the biological side and the computational side of the subject — even more so in the second edition. There's an introduction to each chapter for biologists that gives an overview of the computational side and a similar introduction for computational students.
What sets your book apart from other bioinformatics texts?
First, the book is based on an extensive literature review with substantial reference lists in each chapter. Second, we try to explain both the computational and the biological backgrounds of bioinformatics. A lot of detail on population genetics and sequence variation in populations is given in the genome chapter, for example. Third, we try to include the whole subject area of bioinformatics in one text but in enough detail to be useful. The new edition has chapters on Perl programming and statistical analysis of microarray data.
Your book is subtitled Sequence and Genome Analysis, but it's about much more than that, isn't it?
One could initially propose that analysis of RNA and protein structure does not fall into the general area of sequence and genome analysis — although these are included because they have been important areas historically in bioinformatics. However, new genomes are scanned using models of RNAs and proteins in order to catalog the genes that are present. Similarly, phylogenetic analysis is not primarily directed toward sequence and genome analysis but to understanding evolutionary relationships among organisms. However, the same computational methods used in phylogenetic analysis are also used to discover gene families and sequences that have the same function in different organisms. Microarray analysis is also concerned with function and interactions among genes. These areas are given in enough detail that their application to sequence and genome analysis can be appreciated by students.
You've expanded sections on proteomics and microarray analysis. What areas have you focused on?
In the proteomics chapter, we've added descriptions of the latest methods for structure prediction based on conserved amino-acid patterns, energy calculations, and solvation energies for amino-acid substitutions within the protein core. In the new microarray chapter, we offer guidance in analyzing microarray data due to the many sources of experimental variation in the data. We emphasize the importance of experimental design, replication of experiments to discover the most reliable expression data, and the use of appropriate statistical models. We also describe how to interpret the data biologically and to use genetic data — for example, sequence variations between individuals — to assess regulatory relationships.
What are the current and future operational challenges in terms of comparative sequence analysis?
There are two challenges: First is designing databases for storing the data and learning how to write computer scripts that will automate this process. We've added detailed information on how to use Perl programming for this purpose. The second challenge is including all of the new biological data that are becoming available in a flexible metadata format that can be mined to discover new relationships — for example, lead to a new diagnostic tool or drug. This area is a new one in bioinformatics, and will be a subject of future textbooks and papers.
Are you satisfied with the level of bioinformatics teaching?
My goal in bioinformatics teaching is that if you are going to use a computer program or tool, you should understand what it is doing and the assumptions that underlie the analysis. Otherwise, one may get led badly astray or miss out on an important result. This problem is most profoundly seen in the analysis of microarray data, in which methods are often used to find a list of genes without any understanding of the underlying analysis. The problem is just as acute in the rest of bioinformatics. In some cases, a bioinformatics course may gloss over this hard stuff in order to spend more time on using the available tools. Admittedly, most students don't want the details, but we should still try to teach the underlying assumptions as much as possible and test that the students understand them.
A second point is that bioinformatics is expanding in size and complexity. It is very difficult to give just one course anymore. Basic training should include mathematical modeling, statistics and probability, computer programming, and data management. There should also be more advanced training in areas of specialty in bioinformatics.