But two researchers — computer scientist Golan Yona of Cornell University, and computational biologist William Taylor of the National Institute for Medical Research (NIMR) in London — are both on the verge of seeing order and individuality in these massive clumps, discerning familial relationships others cannot. While their colleagues often get lost in debating the infinite subtlety and origami-like folds as one clump bends over another, Yona and Taylor are looking for proteins' ancestral ties, for categories broad and narrow. Their inspiration dates to 1869 and the creation of the periodic table, which provided clues about groups of elements from noble gases to radioactive minerals.
Just as each element can be placed in one rightful spot in the periodic table, Yona and Taylor each hope to tuck any protein into a systematic, logically organized schema of his own devising. The two are unfamiliar with each other's work. But like Russian chemist Dmitri Mendeleev, who first conceived the periodic table, they hope that pigeonholing proteins will help manage the outpouring of protein data.
This is not the first time that biologists have sought inspiration from the periodic table. Back in 1996, Eric Lander, director of the Whitehead Institute Center for Genome Research, penned a commentary in Science in which he likened the Human Genome Project to biology's periodic table, containing "not 100 elements, but 100,000 [sic] genes; not a rectangle reflecting electron valences, but a tree structure depicting ancestral and functional affinities across human genes. The biological periodic table will make it possible to define unique 'signatures' for each building block."
There is no easy way, however, to gain insight into the biological functions or close chemical relatives of any of the newly sequenced or structured proteins. "Just because the proteins are in the database doesn't mean we know what they do," says Yona. "The object I would like to map is protein, not protein sequence or protein structure."
Most existing databases, Yona says, are confined to sequences and structures. But physicians and drug companies know a protein sequence is not enough. It would be more alluring — and lucrative — to understand if a new protein can be sopped up by a medicine or snipped in half by a molecule already present in the body. Improved classification is a major step toward that goal.
But as Yona points out, any attempt to sort and categorize proteins can be frustrating. Much of the time, an online search will produce no clue as to which family a protein belongs to, no hint of which folds it shares with other proteins. Yona predicts that this will change. "There is a structure in the protein space," he notes. "Proteins are being grouped in superfamilies."
To make the family tree of proteins a reality, Yona is using computers. He has 68 dual-processor Intel machines with 20GB of memory and a terabyte of disk memory. He's also refining a program called ProtoMap, which, when given a protein, will supply its "location." "This is automatic classification without human intervention," says Yona. "One can submit his or her protein and the analysis will suggest a putative classification." This location will be a position on an abstract landscape that relates one protein to other proteins already mapped. "We are trying to construct, from all this local information, a global picture," says Yona. "Your protein will be mapped into a specific 'area.'"
Yona is drafting his map with pre-existing knowledge about each protein, such as its linear amino acid sequence, interactions with other proteins, structural information, and biochemical pathways. To do so, he relies on algorithms, scoring schemes, and statistical models. But because it is a virtual map, it will be possible for him to chart additional dimensions as new types of protein-related data become available.
As an analogy, suppose you were considering a vacation in Florida. Just by knowing that location, you would have a few clues: a sunny climate and the proximity of elderly people who tend to drive slowly. In similar fashion, Yona's map will provide clues about proteins, helping identify a protein's relatives based on what scientists already know. "Even if you don't know anything about your protein," Yona explains, "you get a list of protein clusters. If your protein is in the center of a cluster, you can get a lot of information about the protein." The closer a protein is to the heart of a cluster, the more related or ancestral it is with all the surrounding proteins.
Yona's track record suggests he may succeed. While a postdoctoral fellow at Stanford University, he created a precursor called BioSpace, which enabled him to map 400,000 proteins and assemble detailed 3-D structural models of 160,000 proteins. Yona started with two algorithms (Smith-Waterman and PsiBlast) to create an exhaustive list of neighboring sequences for each protein sequence in the SWISSPROT and TrEMBL databases. But he also developed an algorithm to find higher-level structure, creating a statistical model of each protein cluster that could be compared to other protein clusters.
That early work allowed Yona to create a graphical representation of the proteins, with clusters and edges vaguely reminiscent of any large company's organizational chart. The National Science Foundation was sufficiently impressed to award Yona a five-year, $1.1-million Early Career Development grant to tackle even more proteins. Just as mapping software on the Internet can direct you to the nearest McDonald's, ProtoMap's next version will tell scientists where a particular protein falls on his map. In theory, every protein will land somewhere on that map.
Oddly enough, William Taylor had a related idea across the Atlantic. Head of the division of mathematical biology at the NIMR in London, Taylor is trying to bring a measure of objectivity to protein classification. The conventional attempts to classify proteins remain surprisingly subjective, he says, with experts often categorizing proteins one at a time. Protein classification is in the eye of the beholder.
Scientists from different disciplines dissect the same domains on the same protein in different ways, he explains, some basing their decisions on the amino acid sequence, others on function. "People have chopped them off differently," he says. "Once they start doing that, it gets pretty messy. I wanted something that linked them all together, to get all the proteins under one roof."
Taylor's solution is an unnamed but freely available Linux program that matches a large number of biologically significant proteins against three of his own "Forms." The software runs on his laptop. These Forms are not complex or arbitrary, but simple combinations of alpha helices and beta sheets, the schematic corkscrews and ribbons that form the building blocks of most large proteins.
In the April 11 issue of Nature, Taylor published a paper that places proteins into a rudimentary grid that roughly resembles a simple periodic table. Once a particular protein has been compared to one of his Forms, it can be placed in a series of charts that define the twists and turns (he calls them "curl" and "stagger") of any protein fold. As a test of the software's ability to process large batches of proteins, Taylor tried to classify 2,230 proteins. His Forms accounted for 80 percent of the shape of each protein.
As Taylor admits, the analogy to the periodic table of elements is not perfect. For example, unlike the periodic table, where the position of an element is governed by the number of protons, there is no single fact about any protein that determines its location in his grid.
But there is a rough correspondence between proteins and elements, he says. Just as molecular orbitals fill with electrons in predictable ways, Taylor sees proteins filling up with predictable layers and combinations of alpha helices and beta sheets. What makes that approach unique is not so much the algorithms or the Forms, but the way they allow him to sidestep a bewildering profusion of corkscrews and ribbons. The study of such shapes — the analysis of the clumping — is called topology.
By scientific standards, Taylor's language in his Nature article is fairly brazen. He says he has figured out a shortcut to protein topology, which has bedeviled thousands of scientists for decades: "The previously difficult issue of topology becomes almost trivial," Taylor writes, "because two proteins matching the same ideal Form will either have the same connection of their secondary structures or will not."
If Taylor can indeed sort proteins automatically, computers could take on a larger share of the work, categorizing proteins rapidly in large batches. That would be a big change from the present. "There is a lot of hand-editing," he says of one protein database, CATH, in London. "We're expecting an awful lot of protein structures coming in. It would be nice to sort them without doing a lot of editing." Haggling about the positions of various folds and domains could be delayed until after the protein has been appropriately grouped with its chemical cousins.
With time, and with computers crunching through more proteins, all proteins could be viewed as variations on a few basic themes, not as one-of-a-kind structural behemoths like the cathedral at Chartres. In a sense, Taylor is trying to view proteins more as nature does: as large molecules with simple subcomponents that can be combined in a large but finite number of ways.
During evolution, after all, proteins have been conserved in nearly identical shapes from one species to another over vast periods. But those shapes are so amorphous to the untrained eye that it is difficult to discern the rules that have been in nature's playbook all along. Knowing the rules could let scientists predict protein structures more accurately on computers (see "Computational Biologists Join the Fold"), rather than work out the structures in the lab — a grueling and lengthy process.
There's one limitation to Taylor's pigeonholes: He can't quite classify some proteins that have internal repetition. These can have propeller motifs, or triangular themes resembling three-leaf trefoils, prisms, and helices.
For now, Taylor is working on a propeller Form to accommodate those. But he hesitates to say whether every last protein will eventually be fitted into a box in his or anyone else's chart. "Our experience with proteins so far," says Taylor, "is they can always come up with something weird and wonderful."
PHOTO CREDITS: SPOTTING THE STARS BY GOLAN YONA; DIGITAL PIGEONHOLES BY NATURE; THREE TIMELESS FORMS BY NATURE