By Beth Schachter
June 12, 2002 | At conferences on anything from drug development to biodiversity, complaints abound about the plethora of data and the paucity of people with the skill sets needed to use and interpret it. Hybrid scientists, facile in biology as well as math and computer science, are rare.
Indeed, the current shortfall of bioinformaticians could be as much as fiftyfold, says Eric Lander, founder and director of Whitehead Institute Center for Genome Research. "We're just kindergartners reading this[genomic] information," Lander told the keynote audience at the BioITWorld Conference & Expo in March.
Observers from the IT community agree. "Today, biology and biochemistry are really all about pattern matching, pattern discovery, sequence homology, and algorithms for molecular dynamics," says Jeffrey Augen, director of strategy for IBM Life Sciences. Building a workforce capable of using today's IT tools to solve life science problems is a huge challenge.
The good news: The number and scope of resources for academic training in areas such as bioinformatics and computational biology are growing rapidly. A strong commitment from public and private sources may help train a new breed of life scientist for the post-genomic era. What's more, for those scientists willing to acquire the necessary skills, the starting salaries are impressive.
The demand for the marriage of biology and IT skills is a recent phenomenon, with the higher priority placed on a biological background, according to Boston-area recruiting firm Remington International, which also notes starting salaries of $60,000 to $90,000 are typical for qualified applicants.
Not surprisingly, location influences earning potential. In the Boston area, newly minted life science Ph.D.s with needed IT skills can land annual salaries of $75,000, says recruiter John Good of John Good Associates in New Jersey. In San Francisco, a life science postdoc with just a year of experience can earn $90,000 or more. Outside of Massachusetts and California, salaries decline somewhat, says Good, whose firm is one of the few specializing in finding candidates with bio-IT skills including bioinformatics, pharmacogenomics, and preclinical computation.
There is a caveat. Although Lander's prediction of the tremendous need for bio-IT staff will no doubt prove true in the long term, few solid statistics exist about the size of the current bioinformatics job market. The general economic slowdown and the specific troubles of the bioinformatics industry (witness the recent folding of DoubleTwist Inc. and Entigen Corp.) are making prospective employers pickier than ever.
"We're building the software to enable scientists to do their science," says John Murphy, CIO of genomics research company CuraGen Corp. "It's important to understand the domain, the genetics, and biological processes." When hiring bioinformaticians, Murphy favors specialists in computational chemistry and computational biology or computer science majors who have gone on to obtain advanced life science degrees.
Roots in Medical Informatics
Many current bioinformatics and computational biology training programs trace their history to medical informatics, a discipline supported for more than two decades by training grants to medical schools from the National Library of Medicine (NLM) at the National Institutes of Health (NIH). These programs initially trained medical professionals and computer scientists to develop electronic patient databases, clinical decision support systems, and intelligent agents. Indeed, intelligent agents - programs that perform tasks such as retrieving and delivering information and automating repetitive tasks - sound a lot like today's genome database searching programs. Bioinformatics began to emerge as a distinct discipline roughly 10 years ago.
Current bioinformatics programs offer mainly postdoctoral training for individuals in fields such as computer science, psychology, or biology. The NLM also awards individual postdoctoral fellowships in bioinformatics and medical informatics, geared for scientists seeking advanced training at places that lack institutional training grants. Twelve institutions currently offer these programs; the number will grow to 18 in July.
Stanford University, which had an early NLM-funded postdoctoral program in medical informatics, now offers training that integrates biology and medicine with computer science and math. In 2000, the program officially became the Biomedical Informatics Training Program, and it now grants master's and doctoral degrees. Part of the program's funding comes from a training grant from the NIH's National Institute of General Medical Science (NIGMS).
The University of California at San Diego and Washington University in St. Louis also offer pre-doctoral training programs in bioinformatics and computational biology from NIGMS, according to James Cassatt, director of NIGMS' division of cell biology and biophysics.
Russ Altman, a graduate of Stanford's original medical informatics program, now directs the Stanford graduate training program. In this hybrid program, everyone studies math and IT. "Our biology students learn a bit about downstream medical applications, and the clinical students are made aware of the molecular medicine revolution," says Altman. Stanford recently approved an independently designed major program for undergraduates. "We thought that informatics might be too specialized for the undergraduate level," says Altman. The new program is in biocomputation.
Making Midcareer Moves
Cassatt, who recently became the acting director for the new Center for Bioinformatics and Computational Biology, a NIH-wide group, says that the NIH "has extramural programs for just about everything." The NIH is striving to encourage people who have quantitative skills - in computer science, IT, math, or physics - to move into biology.
"For people with established careers who want to make this switch - for example, a physicist working in industry - we have a Mentored Career Development Award," says Cassatt. Senior fellowship programs are offered to academics who want a sabbatical to train in a different discipline.
The University of Pennsylvania in Philadelphia also has a long-standing training record in bioinformatics. In the early 1990s, faculty members realized that genome sequence data was going to be "the next big thing," according to evolutionary biologist Warren Ewens. In 1994, the school got its first training grant from the National Science Foundation, for supporting doctoral students and postdoctoral fellows, coming from both biology and computer science, says Ewens, who directs the training program. Penn also has programs at the master's and bachelor's levels.
One challenge for students and scientists seeking bio-IT programs has simply been finding them. Now, in addition to a few high-profile initiatives such as those cited here, a new crop of interdisciplinary courses is sprouting at colleges in the United States and other countries. All of these programs must bridge cultural gaps between biology and computer sciences.
"Concern for theory, an interest in applying quantitation and making models, which is now becoming extremely important in biology, was not popular among [the previous generation of] molecular biologists," says David Galas, chief academic officer of the new Keck Graduate Institute of Applied Life Sciences in Claremont, Calif. "They had a culture that said that if you need statistics to understand your experiment, you need a better experiment." That's hardly practical today when most biological research produces a flood of data. "[Now] quantitation is the only thing that's important,'' says Galas. "You can't build models without being able to predict things quantitatively."
Even when biologists recognize the importance of using statistical rigor, they often don't appreciate the difficulty of applying complex IT analytic tools to large data sets, says Gio Wiederhold, a founder of Stanford's medical informatics program. Computer scientists sometimes successfully integrate two or three databases. They show the results to their biologist friends, who then want to integrate many databases, assuming that process is a linear exercise, he says.
"Computer scientists have not shouted fast enough, 'We can't do that!'" says Wiederhold. "Going from three to 100 may be a million times more complicated because their interaction may be N2 where N equals the number of databases." Wiederhold suspects that computer scientists haven't argued assertively to biologists about this limitation because the problem is obvious to computer scientists.
When Penn was planning its introductory bio-IT courses, the faculty contemplated having separate courses for biologists and computer scientists/
mathematicians because of the need to cover basic information in each of the disciplines. Penn instead opted to bring together students with different backgrounds from the start in one course. Students helped to teach each other through projects as part of hybrid teams. The experiment seemed to work well; Ewens says the course gets more applicants than the school can handle.
Developing faculty expertise to teach informatics is yet another thorny issue. Although many schools are eager to offer bioinformatics courses, few are willing or able to hire new faculty devoted to such a narrow specialty. One solution is to develop the needed interdisciplinary expertise within existing faculty.
Phillip Meneely, professor of biology at Haverford College, just a few miles from Penn, heads a program funded by the Howard Hughes Medical Institute (HHMI) to foster science curriculum development at liberal arts colleges. Last year Haverford decided to use some HHMI funds to create a bioinformatics program.
"Bioinformatics is inherently interdisciplinary," says Meneely. "This makes it a natural field for a liberal arts college, where relationships between faculty members in different departments are already strong, and students usually take advanced courses in more than one department." Rather than hire a new faculty member trained in bioinformatics, Meneely invited Ewens and his Penn colleagues to teach Haverford's current science faculty, which they did, with a 10-week lecture and computer lab course.
"We were very pleased to find that many of us already knew quite a lot," says Meneely. "For example, one mathematician who works on probability and alignment lacked the background in molecular biology to apply this to amino acid alignments. Several of us in biology had varying degrees of familiarity with BLAST but did not realize that he would understand some of the mathematics behind it." The course, says Meneely, has led to "fruitful and interesting conversations" among faculty members.
HHMI has helped to foster several computational biology programs at various educational levels. HHMI facilitated the creation of one of the first undergraduate bioinformatics majors at Rensselear Polytechnic Institute in Troy, N.Y. An HHMI-funded program at Harvey Mudd College in Claremont, Calif., applies quantitative studies across the curriculum, including biophysical modeling, ecology, demography, and genomics. Vassar College in Poughkeepsie, N.Y., has used HHMI money toward a program in cellular visualization technologies, and Harvard University has devoted a portion of its grant to give bioinformatics training to high school teachers in the Boston area.
The growth in number and range of bio-IT programs bodes well for the industry. Just a few years ago, such courses were relatively rare. With many informatics positions unfilled and prospects for the shortfall to grow, both academia and the NIH are acting aggressively to fill the void.
Beth Schachter is a writer in New York City whose Web site is www.nasw.org/users/bschachter.
Sidebar: Informatics Introduction: Pondering the Paper Clip
Consider the ubiquitous paper clip - a slender strand of twisted tin, used unthinkingly, then tossed in a bin. What is its place in the world of paper fasteners and why?
Don't laugh. That's a question students in Jessica Kissinger's University of Pennsylvania bioinformatics class must try to answer. Effective pattern matching is the key to solving many bioscience problems, and understanding the rules for identifying patterns is essential to developing the algorithms needed for effective bioinformatics.
Kissinger gives her students five minutes to diagram the relationships among the paper fastening devices. "The 'straight' piece of metal is meant to be the ancestor. Students wrangle with how to group and relate the fasteners. What is more important: color, shape, size, composition? They quickly learn to see conflict within the data - not all fasteners of the same shape are the same color, not all fasteners of the same color are the same shape or size," says Kissinger.
"I use the exercise to show students that they intuitively relate the fasteners using parsimony criteria; that is, they make groups based on 'shared derived characters,'" she says. In essence, students are learning to build phylogenetic trees, or branching diagrams depicting the evolutionary relationships of organisms.
The exercise readies them to understand key concepts:
Maximum parsimony: A method for choosing the phylogenetic tree needing the fewest substitutions (steps, or changes) from among all possible phylogenetic trees.
Consistency index: A number describing "the fit" of a single character (or of the whole set of characters) on a tree. It is calculated by dividing the minimum possible tree length by the observed tree length. If the data map perfectly to the tree, i.e., there is no conflict caused by characters evolving more than once independently, then the consistency index equals 1.0.
Synopomorphic: A shared derived character.
Autapomorphic: A unique derived character, not shared with other taxa.
In the described exercise, Kissinger says all of the objects are "paper fasteners since all of the items fasten paper, i.e., they have the same function, but they are not directly related. Two of the fasteners are not paper clips; they are 'bull-dog' clips. They provide examples of convergent evolution toward a common function and actually must be excluded from the analysis." - B.S.
Sidebar: Making the Grade
No time for a Ph.D. in bioinformatics? A number of universities offer bioinformatics programs designed for working professionals. Northeastern University's Graduate Certificate in Bioinformatics Essentials, for example, can be completed in a year or less, depending on experience. Classes meet once a week for three consecutive terms, and may be taken in any order. Here's the typical breakdown of the NU syllabus:
Genomics and Pharmacogenomics
Topics: Mendelian and bacterial genetics; mapping methodology, sequence technologies and comparative genomics; analysis of relationship between gene activity and cell function in health and disease states; modeling of metabolic regulatory pathways; expressed sequence tags; pharmacogenomics.
Sequence Analysis with Structure Predictions and Signal Processing
Topics: Current and future genomic data; local and global alignments; phylogenetics and evolution; multiple sequence alignments; base calling and fragment assembly; dynamic programming concepts; Hidden Markov Models; profiles; motifs; differences between nucleic acid and amino acid analysis; ORF coding region predictions. Students perform hands-on database searching.
Topics: Quantitative PCR; strengths and weaknesses of the expression measurement method; data visualization; clustering and analysis; applications to toxicology; drug target and diagnostic marker discovery; applications of microarrays other than expression measurement SNPs, resequencing, and others; expression information database use.
Topics: Systematic analysis of protein profiles of healthy and diseased tissue; 2-D gel electrophoresis; mass spectrometry; useful analysis methods; database issues; mass screening of proteins; 2-hybrid systems and protein-protein interactions; proteomics/genomics relationship; new technologies.
Combinatorial Chemistry and the Drug Discovery Process
Topics: Introduction to combinatorial libraries; parallel synthesis: solid-phase and solution phase techniques; solid supports for organic synthesis; solid-phase synthesis and polymer supported quench reagents; compound libraries; liquid chromatography-mass spectrometric techniques and data handling; the mix and split solid-phase method and decoding using chemical or physical tagging; diversity analysis; Tanimoto coefficients; peptide libraries; multiple component condensations; medicinal chemistry; databases for building block and product archiving; creation of a virtual library and afferent library design software. - Amanda Fox