YouTube Facebook LinkedIn Google+ Twitter Xingrss  



Computational Biologists Join the Fold

CASP5 competitors compare the best algorithms for modeling the 3-D structure of proteins—an exercise that could lead to new insights into the pathogenesis and treatment of disease.

By Karen Hopkin 

June 12, 2002 | Think of summer, and an idyllic image of lazy days filled with sunshine and seashells, high dives and ice cream might spring to mind.

Unless you happen to be part of a cadre of chemists, biologists, and computer scientists dedicated to developing programs for predicting the structure of cellular proteins.

Man and Machine 
While research groups spent the summer of 2000 scrambling to generate their predictions for CASP4, automated servers at two dozen sites around the world coolly submitted their best guesses for the same set of target proteins — less than 48 hours after the sequences were posted.

Read More 
  
These researchers have swapped their beach towels for computer time because summer is the season of CASP — the loose acronym for Critical Assessment of Techniques for Protein Structure Prediction. In this biennial competition, participants put their algorithms to the test by trying to predict the precise 3-D shape of an array of target proteins in which the structures have been solved (or are about to be solved) experimentally but are not yet public knowledge. The submissions for CASP5 — the fifth CASP competition — will be evaluated this fall and the results reviewed at a post hoc conference in December.

"You download the targets, upload your predictions, and get to be publicly embarrassed every other year," says Jeffrey Skolnick, director of computational and structural biology at the Donald Danforth Plant Science Center in St. Louis and CASP veteran.

But the exercise is more than an amusing academic parlor game. Algorithms that accurately describe how proteins fold into their 3-D conformations would help scientists in understanding how mutant proteins lead to disease, designing drugs to correct these defects, and deciphering the information locked in the genomes of every organism on earth, from an anthrax bacterium to an Olympic athlete.

Doing the Math 
To understand why determining a protein's structure is so computationally intensive, imagine enumerating and sifting through the potential structures that might be adopted by a modestly sized protein only 100 amino acids in length.

Read More 
  

Predicting protein structure has been considered a grand challenge in biology since the 1960s, when Nobel laureate Christian Anfinsen demonstrated that an unfolded protein could refold into its biologically active conformation spontaneously in a test tube. This observation led Anfinsen to conclude that all the information a protein needs to fold properly is contained in its amino acid sequence. A protein is synthesized as a linear chain of amino acids, like a string of beads. Based on the order and identity of these amino acid residues, the protein folds itself into a 3-D shape that dictates its function. One set of curves might allow a protein to shimmy along a piece of DNA and switch on genes, another might form a selective channel through which sought-after sugars slip into the cell.

But how does the protein translate its linear amino acid sequence into a 3-D conformation? And how can researchers do the same? "It's like doing a jigsaw in 3-D," says David Baker, a Howard Hughes Medical Institute investigator at the University of Washington at Seattle, whose protein-folding prediction program — called Rosetta — blew away the competition at CASP4 in 2000.


Unleashing the Algorithms 
Although participating in CASP can be intense and exhausting — one wag calls the experience "combat science" — the exercise offers an excellent opportunity for researchers to test their methods. And because the structures are not known in advance, the test is truly blind. "There's no way to hide, no way to fudge your data to look better," says David Deerfield, director of biomedical research at the Pittsburgh Supercomputing Center. "In a few months we'll know who did well, who did OK, and who sucked."

More importantly, CASP provides a means to assess the overall development in the field of structure prediction, tracking the progress of algorithms, and evaluating approaches

What Big Clusters 
Computers have come a long way in the past 20 years.

Read More 
  
and methods that work well or show the most promise.

"The first CASP [in 1994] was a bit of a shock," recalls Ram Samudrala of the University of Washington, who was then a graduate student. No novel structures were predicted correctly. "We were zero percent successful," says David Osguthorpe, a chemist at the University of Bath in England.

And the community did not fare much better in CASP2. "It was very clear that [protein folding] was still an unsolved problem," Baker says.

By CASP4 two years ago, researchers had honed their algorithms and were hitting about 30 percent of their targets — some even spot on. The greatest progress has come in the ab initio approaches, which rely largely on simulating the forces that guide the interactions among the atoms of a protein. "Folding is a physical problem," Osguthorpe says. "If you represent every single atom, and model the forces between them, you should be able to fold the protein."

Ay, there's the rub. Representing all the atoms — maybe 5,000 in the protein and another 500,000 in the water around it — and shuffling them around in space would take too much computer power. So researchers generally start by simplifying their simulation of the protein sequence, shrinking certain chemical groups down to points or converting them into ellipses endowed with certain telltale properties, such as whether they are attracted to or repelled by the water molecules that surround them.


Shake, Rattle, and Roll 
Next, they add the forces and let the protein start to fold. Some algorithms get the ball rolling by assigning crude or random shapes to the target sequence before letting these structures writhe, jiggle, and flex until they reach a comfortable conformation. The process is repeated tens or hundreds of thousands of times, generating a veritable bestiary of folded proteins.

The challenge then becomes choosing which of these shapes is the correct one for the target protein — arguably the toughest part of structure prediction. The "best" shape is generally believed to be the one that winds up with the lowest free energy, a sort of atomic "relaxed fit" configuration. Researchers use a variety of different approaches to scour the energy landscape in search of the structure or structures with the lowest energy. If the process is akin to exploring the Earth's surface in search of the Grand Canyon, the trick is to avoid landing in a ditch and declaring success.

At this point, the algorithm has narrowed the field to a few thousand potential structures that appear to have a biologically plausible, low-energy conformation. Some researchers apply a "knowledge-based" filter, scoring the structures based on an understanding of how proteins of known structures fold. Do the predicted folds match anything in the protein structure database? Are the water-fearing amino acids buried on the inside of the protein? Are the water-loving residues exposed?

Finally, the algorithms look to see whether the structures cluster. "If they do, that suggests you're in the right area," Skolnick says. The correct structure is likely to be the one that the folding program produces again and again.

Furthermore, seeing where and when structures accumulate on the energy landscape may suggest something about the pathway the protein follows as it folds, says Harold Scheraga, a chemist at Cornell University who has been in the protein-folding biz for 40 years.

Scheraga and his colleagues have been working with Protein A, a small protein that contains three alpha helices wrapped in a bundle. Each alpha helix is a winding staircase-like structure that forms when an amino-acid chain twists itself into a corkscrew. By examining the structures that congregate on the energy landscape in their algorithm, Scheraga and his team can see that one of Protein A's helices appears to form before the others, a finding that is backed up by experimental studies of how the protein folds.


The Rosetta Affair 
If prediction algorithms can shed light on protein-folding pathways, the opposite is also true. In a sense, the Rosetta program that was CASP4's clear winner mimics how real proteins are thought to fold, with small stretches of sequence flickering in and out of favored structures before the whole protein settles into its preferred shape.

To simulate this process, Rosetta first breaks a target protein into segments about nine amino acids in length. The algorithm then consults the protein structure database to determine which structures — alpha helix, beta-pleated sheet, or loose coil, for example — each sequence is likely to adopt. Most short sequences are not strictly monogamous when it comes to choosing a preferred structure; instead they flirt with a small coterie of local conformations. Rosetta then generates a selection of mix-and-match shapes that contain random combinations of these short corkscrews, sheets, and coils. The most favorable of these combinations should represent the protein's native structure.

By checking the structure database so early in the process, Rosetta restricts the number of potential structures that later need to be searched for the conformation with the lowest energy. That is helpful because sifting through a haystack of structures looking for that low-energy needle takes the most time and computational muscle. Using large clusters of computers that can do this sampling in a parallel fashion could take hours to generate potential structures, but a few days to find the best ones.

The problem? "The big space is big," says Alberto Segre, a computer scientist at the University of Iowa, who hopes to iron out the mathematical kinks in his algorithms in time to join in CASP5. As a protein grows in length, the number of possible conformations it can adopt grows exponentially. "It's hard to explore a space that's exponentially large," Segre says. "Maybe it can't be done."


Wal-Mart Wisdom 
In fact, Segre sees protein folding as a problem in optimization — like working out the most cost-efficient trucking schedules for Wal-Mart. "You don't always have to find the best solution," he says. "Sometimes finding a pretty good solution is good enough." And if protein folding algorithms can get close to the right answer, or get the right answer 90 percent of the time, that might suffice.

In the course of predicting a protein's conformation, a program will spawn thousands upon thousands of potential structures. Many of these structures are not obviously erroneous or necessarily unnatural. They might be the wrong conformation for the target protein, but the perfect fold for another polypeptide chain.

The trick comes in teaching the algorithm to select the right one. "Even if the best structure is in the top 10, how do you figure out which one it is?" asks Troy Wymore of the Pittsburgh Supercomputing Center.

One way that researchers train their algorithms to recognize false folds involves exposing the programs to decoy data and then tuning them to avoid generating or selecting these incorrect shapes. Some have stocked a public database with such conformational red herrings, but the menagerie of false structures generated with each prediction should work just as well.

In addition to educating algorithms, the aberrant structures may reveal something about how proteins fold, or how they misfold. "If this zoo of alternative topologies is available to a protein in the cell, perhaps misfolding occurs more often than we realize," says the Danforth Center's Skolnick. Each faulty structure may represent a wrong turn that real proteins might take on the road to finding their correct fold.

For example, David Baker's Rosetta algorithm often produces structures in which the polypeptide chain zigzags back and forth, forming stacks of beta sheets. Although such a structure is never seen in a normal protein, beta strands do appear to pile up in amyloid plaques — the protein deposits found in people with Alzheimer's disease or cattle with "mad cow" disease.

"The structural mistakes we see on the computer may correspond to mistakes that proteins make in the body in real life," Baker says. Thus even the chaff generated by the current crop of algorithms may provide insights into the pathogenesis of disease.

Researchers ultimately hope to use these predicted structures to design better drugs — or even better proteins. Until then, CASP organizers hope that participants in CASP5 will continue to get better at predicting the structures of larger and more complex proteins, eventually achieving the sort of accuracy and precision that crystallographers and spectroscopists have come to expect when they determine protein conformations in the laboratory.

"We're still not doing well enough to impress an experimentalist," says Michael Levitt of Stanford University, a longtime devotee of CASP. "We want to be able to hand a crystallographer a diskette and say, 'Take a look at this,' and, after reviewing our predictions, he'll think, 'Oh, I wasted my time solving these structures.'

"We're not there yet," Levitt says, "but maybe in two or three CASPs time." * 

Karen Hopkin is a writer based in Somerville, Mass. She can be reached at khopkin@nasw.org. 



For reprints and/or copyright permission, please contact  Terry Manning, 781.972.1349 , tmanning@healthtech.com.