Annual Reviews -- Levitt et al. ( Biochemisty 66: 549)

Reprint (PDF) Version of this Article

PubMed Citation

PROTEIN FOLDING: The Endgame

Michael Levitt, Mark Gerstein, Enoch Huang, S. Subbiah,* and and Jerry Tsai
Department of Structural Biology, Stanford University School of Medicine, Stanford, California 94305;

Molecular Biophysics and Biochemistry, Yale University, Bass Center, New Haven, Connecticut 06520;

*Wistar Institute, 3601 Spruce Street, Philadelphia, Pennsylvania 19104,

and Bioinformatics Center, National University of Singapore, Kent Ridge, Singapore

KEY WORDS: protein folding, packing, side-chains

ABSTRACT

INTRODUCTION

WHAT CAN WE LEARN FROM X-RAY STRUCTURES?

WHAT DOES EXPERIMENT HAVE TO SAY?

MODELING THE PACKING OF SIDE-CHAINS

GETTING TO THE ENDGAME

CONCLUSIONS

LITERATURE CITED

	ABSTRACT

Top Next Literature Cited

The last stage of protein folding, the "endgame," involves theordering of amino acid side-chains into a well defined andclosely packed configuration. We review a number of topicsrelated to this process. We first describe how the observedpacking in protein crystal structures is measured. Such measurementsshow that the protein interior is packed exceptionally tightly,more so than the protein surface or surrounding solvent andeven more efficiently than crystals of simple organic molecules.In vitro protein folding experiments also show that the protein is close-packed in solution and that the tight packing andintercalation of side-chains is a final and essential stepin the folding pathway. These experimental observations, inturn, suggest that a folded protein structure can be describedas a kind of three-dimensional jigsaw puzzle and that predictingside-chain packing is possible in the sense of solving this puzzle. The major difficulty that must be overcome in predictingside-chain packing is a combinatorial "explosion" in the numberof possible configurations. There has been much recent progresstowards overcoming this problem, and we survey a variety ofthe approaches. These approaches differ principally in whetherthey use ab initio (physical) or more knowledge-based methods,how they divide up and search conformational space, and howthey evaluate candidate configurations (using scoring functions).The accuracy of side-chain prediction depends crucially on the (assumed) positioning of the main-chain. Methods for predictingmain-chain conformation are, in a sense, not as developedas that for side-chains. We conclude by surveying these methods.As with side-chain prediction, there are a great variety ofapproaches, which differ in how they divide up and searchspace and in how they score candidate conformations.

INTRODUCTION

Top
Previous
Next
Literature Cited

	INTRODUCTION

The endgame of protein folding refers to the final stage inthe folding process. It is believed that at this point inthe process the overall fold has already been determined andthe side-chains are close to their final positions. The previoussteps in the folding process, especially those that determinethe shape of the overall fold, are thought to be greatly, if not completely, dictated by hydrophobic interactions (1, 2,3). However, here we argue that the endgame transition tothe native structure is governed by somewhat different interactions:tight close-packed contacts between amino acid side-chains.The creation of these contacts has been compared to crystallization(4). Clearly, such tight packing is related to the most importantcharacteristic of native protein structures: their unique and precisely determined yet highly complex three-dimensionalshapes. The packing process is likely to be energeticallydifficult as side-chains prefer to be disordered. The process,therefore, will have a high activation barrier and will beslow.

Packing as a phenomenon is easily visualized and is commonplacein everyday experience. It is dominated by a simple universalenergy term, the strong repulsion between atoms that approacheach other too closely. Packing is a short-range phenomenon,which allows a more local treatment considering only surroundingneighbors. Richards was one of the first researchers to emphasizethe importance of close-packing in protein structure (5, 6),and his point of view is becoming increasingly accepted.

With its focus on packing, this review on protein folding considersboth experimental work and theory, and is by necessity selective,given the large body of theoretical and computational work.Attention is focused on the type of packing observed in proteinsand on the prediction of packing for side-chains. We alsofocus on the more difficult problem of generating main-chainconformations close enough to make side-chain packing predictionspossible.

We have deliberately not dealt with certain related topics including molecular dynamics (MD) simulation and homology modeling.Realistic MD simulations of protein folding or unfolding insolution are not reviewed, in spite of the recent work inthis field (7, 8, 9). Homology modeling is also not reviewed,in spite of the close connection to side-chain packing using a main-chain "borrowed" from a protein with a homologous sequence.The intent here is to concentrate more on basic principlesrather than on applications. Furthermore, homology modelinghas been recently reviewed (10, 11).

This review is divided into several sections. The first sectiondeals with the observed packing in protein structures as determinedby X-ray crystallography and shows that proteins are moretightly packed than almost any other organic matter. The secondsection extends the review of experimental work to solutionstudies by relating close-packing to stability and consideringhow such close-packing arises during the folding process.The third section shows how close-packing has led to the effective solution of the side-chain prediction problem when a sufficiently native-like main-chain conformation is known. The fourth sectionconsiders how to generate sufficiently accurate main-chainconformations, primarily by searching large spaces of possibleconformations with appropriate energy functions.

WHAT CAN WE LEARN FROM X-RAY STRUCTURES?

Top
Previous
Next
Literature Cited

	WHAT CAN WE LEARN FROM X-RAY STRUCTURES?

The best source of information on packing in protein moleculescomes from the hundreds of highly refined high-resolutionprotein structures that have been determined over the pastthree decades. These structures show a high degree of orderin all the residues, except occasionally those on the surfaceof the protein.

How Is Packing Characterized?
The packing efficiency of a given atom is defined as the ratioof the volume of its van der Waals (VDW) envelope to the amountof space it actually occupies (5, 12, 13). This simple definitionmasks considerable complexity. First of all, how does onedetermine the volume of the VDW envelope (14)? This obviouslyrequires knowledge of what the VDW radii of atoms are, a subjecton which there is no universal agreement (12, 15), particularlyfor water molecules and polar atoms (16, 17). Second, how does one determine how much space an atom occupies? Or, equivalently,how much additional "cavity" volume should be associated witha particular atom in addition to its envelope volume? Theselatter questions can be addressed by various geometric constructions,discussed in the following section.

The absolute packing efficiency of an atom is most useful ina comparative sense, e.g. when comparing equivalent atomsin different parts of a protein structure. In calculatingthe ratio of packing efficiencies, the VDW envelope volumeremains the same and cancels. One is left with just the ratioof space an atom occupies in one environment to the space itoccupies in another.

VORONOI CONSTRUCTION
Voronoi volume calculations are geometrically rigorous methodsthat determine how much space an atom occupies. These calculationswere originally developed by Voronoi (18). They were firstapplied to molecular systems by Bernal & Finney (19) andto proteins by Richards (5). Since then they have been usedsuccessfully in the calculation of standard volumes of proteinresidues, in characterizing protein-protein interactions, in understanding protein motions, and in analyzing cavities inprotein structure (6, 12, 20, 21, 22, 23, 24, 25). They havealso been used in the analysis of liquids (26, 27), and thefaces of Voronoi polyhedra have been used to characterizeprotein accessibility and to assess the fit of docked substratesin enzymes (28, 29).

The Voronoi procedure allocates all space amongst a collectionof atoms. Each atom is surrounded by a polyhedron and allocatedthe space within it. The faces of Voronoi polyhedra are formedby constructing dividing planes perpendicular to the interatomicvectors between atoms, and the edges of the polyhedra resultfrom the intersection of these planes.

The Voronoi procedure requires the location of all neighboringatoms. This is possible in the protein core, but on the proteinsurface many of the neighbors of a protein atom are watermolecules, which are often not well localized in crystal structures.A variety of approaches have been developed to deal with thisdifficulty. The simplest is to surround the protein with ashell of water molecules generated on a regular grid (5). It is also possible to use predefined boundary shapes (such asthe snub cube) to truncate the "open" polyhedra at the proteinsurface (23). This sort of truncation can be smoothly andrigorously achieved by using a particular generalization ofthe Voronoi construction called the alpha-shape (30, 31). InMD simulations employing periodic boundary conditions, all atomsare completely surrounded by solvent, circumventing this problem(17, 27).

OTHER CONSTRUCTIONS
A number of methods for measuring volumes and packing are notbased on Voronoi polyhedra (6). Connolly developed a methodfor the determination of volumes based on the direct integrationof the space inside of the molecular surface envelope (32,33, 34). Gregoret & Cohen (35) developed a simplifiedway of evaluating the packing in a structure at a residue level, rather than at the atomic level.

All the other approaches have concentrated on the explicit identification and measurement of cavities in protein structures (36, 37,38, 39, 40, 41, 42). The advantage of cavity identificationalgorithms is that the exact location of cavities is oftenof great interest. However, because the association betweena particular cavity and a particular protein atom is somewhatarbitrary, one cannot directly calculate packing efficienciesfor individual atoms as with the Voronoi procedure. Anotherdifficulty with cavity identification algorithms is that manyof these algorithms model cavities in terms of idealized sphericalshapes. Such modeling does not allow a complete partitionof space; after the volumes of the spherical cavities and the atoms' VDW envelopes are accounted for, there is still leftoverspace.

How Tightly Packed Is the Protein Core?
Packing calculations on protein structure were done first byRichards more than two decades ago (5) and then soon afterby others (20, 21). These initial calculations revealed someimportant facts about protein structure. First, in the proteincore, atoms and residues of a given type have a roughly constant(or invariant) volume because the atoms inside proteins arepacked together tightly, with the interior of the protein better resembling a close-packed solid than a liquid or gas. Thishigh packing efficiency ratio of internal protein atoms isroughly what is expected for the close-packing of hard spheres(0.74).

More recent calculations measuring the packing in proteins (25)have shown that the packing inside proteins is somewhat tighterthan observed initially (~4%) and that the overall packingefficiency of atoms in the protein core is greater than incrystals of organic molecules. When molecules are packed thistightly, small changes in packing efficiency are quite significant.In this regime, the limitation on close-packing is hard-corerepulsion, so even a small change is quite substantial energetically.Furthermore, Richards & Lim (13) pointed out that the number of allowable configurations that a collection of atoms canadopt without hard-core overlap drops off very quickly asthese atoms approach the close-packed limit.

The exceptionally tight packing in the protein core seems torequire a precise jigsaw puzzle–like fitting togetherof the residues inside proteins. This appears to be true forthe majority of atoms inside proteins (34). However, thereare exceptions, and some studies have focused on these, showinghow the packing inside proteins is punctuated by defects or cavities (39, 42, 43). If these defects are large enough,they can accommodate buried water molecules (44, 45, 46).

Researchers using highly simplified two-dimensional latticemodels to study protein structure have pointed out that tightpacking in the protein core may drive or force the formationof secondary structures (2, 47, 48). This conjecture has beentested on somewhat more realistic off-lattice models of proteinstructure (49, 50). The results have been mixed in the sensethat these models do observe high packing density drivingthe formation of secondary structure but to a much lesserdegree than in the lattice models.

How Tightly Packed Are Other Parts of Proteins?
THE SURFACE
Measuring the packing efficiency inside of the protein coreprovides a good standard, and a number of other studies havecompared this efficiency to that in other parts of the protein.The most obvious thing to compare with the protein insideis the protein outside, or surface. This comparison is particularlyinteresting from a packing perspective because the protein surfaceis covered by water, which is known to be packed much less tightly than protein and in a distinctly different fashion [the tetrahedralpacking geometry of water molecules gives a packing efficiencyratio of ~0.34, less than half that of hexagonal close-packedsolids (51)].

Calculations based on crystal structures and simulations haveshown that the protein surface has an intermediate packing,being packed less tightly than the core but not as looselyas liquid water (15, 17). One can understand the packing beinglooser at the surface than in the core in terms of a simpletrade-off between hydrogen bonding and close-packing. In theabsence of interactions other than van der Waals attractionsand repulsions, liquids (and solids) tend to pack closely,and the geometry of their interaction can be described simplyin terms of a simple hard-sphere (i.e. billiard-ball) model(52). However, if there are also highly directional interactions,such as the hydrogen bond in water, the situation is morecomplicated. Often the close-packing has to be explicitly traded off to maintain hydrogen bonding. This trade-off can be visualizedin simulations of the packing in simple toy systems (53, 54,55).

An important aspect of the looser packing at the protein surfaceis how this packing is expected to change when the proteinsurface binds to another molecule, particularly another protein.Calculations measuring the packing in protein-protein interfaceshave been done, such as those in antibody-antigen and protease-inhibitorcomplexes (56, 57). These calculations have shown that thepacking at protein-protein interfaces is roughly comparableto that in the protein interior and is tighter than the packingusually observed at the surface. Thus, the formation of a close-packedinterface may be a driving force in docking. Simple shape complementarity(in the sense of a close-packed jigsaw puzzle) is an integralpart of many docking programs (58, 59, 60, 61).

INTERNAL INTERFACES
A comparison of the packing at various internal interfaces insideof proteins, particularly at domain-domain interfaces, isalso interesting. Such comparisons are often closely coupledwith analysis of protein flexibility.

It has been argued that motion is possible across a close-packedinterface such that the close-packing is maintained throughoutthe motion. To prevent the atoms from bumping into one another,the motion has to be fairly small and parallel to the planeof the interface. There cannot be large torsion angle changes,so side-chains maintain the same rotamer configuration (62). A large motion is achieved by concatenating many of thesesmall motions at many different interfaces. This sort of small,sliding motion has been dubbed "shear motion" (63, 64), andit has been carefully documented in numerous cases (65, 66;see 64 for a list). Moreover, physical studies have shownthat a folded protein does not have a single perfectly defined conformation (67). Rather, it has some intrinsic flexibilityand can readily jump among many nearly energetically identicalmicro-states without significantly changing its packing. Thissort of small-scale flexibility is what makes shear motionspossible.

Following a somewhat different line of reasoning, it has alsobeen proposed that certain interfaces may be particularlymobile, precisely because they contain defects and are notclose-packed. This idea was suggested in the 1970s (22). Sincethen a number of workers have noted that there are relativelymore cavities at interdomain interfaces (36, 68) than elsewhere on protein interiors. Hubbard & Argos (68), in particular,claim that these cavities have a functional role in the mechanismsof protein movements.

Packing is also expected to be important in protein motionsinvolving hinges. Numerous studies have emphasized how criticalthe packing at the base of the hinge is [in the same sensethat the "packing" at the base of a door hinge determineshow easily the door can close (69, 70, 71, 72, 73)]. Hinge motions often involve creating a new protein-protein interface (e.g.a new domain-domain interface is formed duriing hinged domainclosure). Calculations have shown that these interfaces areclose-packed in the same manner as the interfaces involvedin protein-protein recognition (72). This conclusion suggeststhat the formation of a new close-packed interface may bea driving force for hinge motions.

WHAT DOES EXPERIMENT HAVE TO SAY?

Top
Previous
Next
Literature Cited

	WHAT DOES EXPERIMENT HAVE TO SAY?

Clearly proteins are close-packed in the crystal state. Suchclose-packing is also seen in protein structures determinedin solution by nuclear magnetic resonance (NMR), but theseproteins are generally rather small (<100 residues) anddo not always have a large core region. This section furtherconsiders proteins in solution. We first examine whether close-packingstabilizes proteins in solution and then review experimental work on how proteins fold to achieve such close packing.

Proteins Are Well Packed in Solution
In solution, volumetric studies of both amino acids (74, 75)and whole proteins (76, 77, 78, 79, 80, 81) have been common.The most recent study (82) is quite comprehensive, covering15 proteins within a temperature range of 18–25°C.The results show that studying whole proteins is more accuratethan measuring the individual amino acid volumes in solution.By studying whole proteins, the authors derive some usefulrelationships on the basis of the molecular weight of a proteinwithout prior knowledge of the crystallographic data. As roughestimates, the van der Waals volume V_w, the molecular volumeV_m, and the accessible surface area S_a can all be relatedto the molecular weight M_r as follows (in Å³): V_w =[100( ± 300)] + [0.77( ± 0.01)]M_r; V_m = [1200(± 500)] + [1.04( ± 0.02)]M_r; and S_a = -[1200(± 200)] + [14.5( ± 0.25)]M_r^2/3. The authorsalso show that packing efficiencies are relatively constantbetween 0.72 and 0.78. This range is very similar to the previouslymentioned packing efficiencies computed from protein structuressolved by X-ray crystallography (6, 20, 25). The fact thatpacking efficiencies are not limited to some finite valuesuggests that the packing in individual proteins is not sorigid as the jigsaw model would have us believe (82A). Withextensive studies of T4 lysozyme packing mutants, Matthews and coworkers (for a review see 82B) have shown that the protein'sbackbone accommodates changes to the size of the protein core.While losing some stability, these lysozyme mutants are stillchemically active. Therefore, proteins possess a well-packed,plastic interior, meaning that the core can tolerate a certainamount of variation in packing density.

Good Packing Leads to Greater Stability
Improving the packing of the protein interior has recently becomea method for increasing stability (13, 83, 84). Nature usesthis principle in the design of thermostable proteins (85),and several groups have successfully applied it to proteindesign. Thus far, researchers have been able to create morestable proteins by intentionally increasing the packing efficiencyfor ribonuclease H1 (86), T4 lysozyme (87), and ${lambda}$ -repressor (88).Most recently, Munson et al (89) have re-engineered the internal packing of the four-helix-bundle protein, Rop. Their resultsfurther support the idea that increasing the core packingefficiency can increase stability; however, it has also beenfound that sometimes the increased stability caused a decreasein function. In a related experiment, Ramachandran & Udgaonkar(90) added significant nonpolar volume to the core of theprotein barstar by chemically modifying its two free cysteines.They showed that the change caused an increase in proteinstability without a decrease in activity or major alterationin structure as measured by circular dichroism (CD). Untilthe crystal structure of the altered barstar is solved, theyreason that this extra stability might be attributed to increasedcore packing efficiency.

How Does Good Packing Arise?
From an unfolded conformation, proteins must somehow establishtheir high degree of side-chain packing. Two descriptive modelsof protein folding, initially proposed in the early 1970s,provide insight into this process. The nucleation model (91,92) argued that protein folding begins with a kernel of residuesmaking specific native-like contacts. Once the protein formsthis rate-limiting configuration, the remaining structure quickly folds into place. Alternatively, in the hydrophobic collapsemodel (93), the protein first aggregates its nonpolar groupsto form a structure with a loose hydrophobic core. Then secondarystructural elements develop around this core, hypothesizedto be similar to molten globule, which finally folds in aslow step to form the tightly packed native structure. In the framework model (94), a slightly different formulation ofthe hydrophobic collapse model, the secondary structure formsfirst, and then the hydrophobic groups aggregate. Therefore,in the nucleation model, the tight packing forms rapidly withno intermediates, whereas for both collapse models, the tightpacking occurs only after the formation of a molten globule–likestate.

Folding Pathways
A current topic of debate is whether the molten globule is anintermediate on or off the folding pathway (for a review see95). Studying the kinetics of intermediate formation can distinguishbetween these possibilities. Put simply, if the molten globuleis part of the folding pathway, its accumulation speeds upthe formation of the native conformation (the folding rateis proportional to the fractionaal concentration of the intermediate).For off-pathway molten globules, formation of these structuresinhibits the formation of the native conformation because the protein must fold back through the unfolded state to reachthe native one (or the folding rate is proportional to 1 minusthee intermediate's fractional concentration). Alternativeor parallel pathways (96) show a certain fraction of the unfoldedspecies fold quickly into the native state, while the remainingmolecules follow a slower on-pathway model. The same researchershave shown that the molecules on the slower pathway form anintermediate with helical secondary structure that is just slightlymore energetically stable than the unfolded state, and thisminor increase in stability retards the folding reaction (GWildegger & T Kiefhaber, submitted).

Furthermore, in almost all the equilibrium and kinetic studies,the authors assumed a sequential pathway for protein folding.This view assumes that folding proceeds similarly to a chemicalreaction (98, 99). The intermediates along this path helpguide the protein to its native state (100). More recent theoreticaldevelopments suggest that folding follows an energy landscape(for a review see 101, 102). In this model, the intermediatesarise because of kinetic traps where the protein is actually slightly misfolded. To continue, the protein needs to unfoldonly somewhat. The model is able to explain the behavior ofsmall, fast-folding proteins (usw. <80 residues), whichfold on the order of milliseconds instead of the usual secondsand without distinguishable intermediates (103, 104, 105, 106,107, 108, 109, 110). Because these proteins are too smallto form stable intermediates, they avoid the kinetic trapsand therefore fold directly to the native state. Another way to make sense of the rapid folding of small proteins is thatthe combinatorial search for correct side-chain packing ina small protein is much simpler and faster than in a largeone. Baldwin (101) notes that this model could be thoughtof as an extension of the jigsaw puzzle folding model (82A).Here, the initial starting state is not fixed, and energetics coupled to a certain amount of randomness determine the foldingpathway.

Equilibrium Experiments—The Molten-Globule State
The molten globule (112) has yielded a great deal of experimental information regarding the structure of intermediates duringprotein folding. This conformational state, an equilibriumfolding intermediate induced under mild denaturing conditions,consists of the following characteristics: (a) It is lesscompact than the native state. (b) It is more compact thanthe unfolded state. (c) It contains extensive secondary structure.(d) It has loose tertiary contacts without tight side-chain packing. Recently, increasing evidence supports the idea thatthe molten globule may possess defined tertiary contacts (fora review see 113). It has been argued that the molten globulestate contains water molecules or is "wet" (114), but an experimentby Kiefhaber et al (115) found that an unfolding intermediatewith molten globule attributes is dry. Strong support foreither case has yet to be found. Beyond these similarities,the molten globule conformations are very diverse among proteinsand even among different molten globules induced from thesame protein (116, 117). For this reason, we discuss eachmolten globule system individually.

CARBONIC ANHYDRASE
The low pH form of carbonic anhydrase shows characteristicsof a molten globule (118). Like others, this molten globuleresembles a kinetic folding intermediate (119). Besides themolten globule, carbonic anhydrase provides evidence for aninteresting second equilibrium intermediate (120). Because thisstate occurs at higher concentrations of denaturant and is less compact than the molten globule, the authors believe thatit represents a premolten globule. They also show that thisintermediate still contains considerable secondary structureand liken it to the burst intermediate seen in kinetic studies(121, 122).

${alpha}$ -LACTALBUMIN
The protein ${alpha}$ -lactalbumin can produce two forms of the moltenglobule under different conditions, both of which have beenwell characterized (123, 124); the acid form is produced atlow pH and the apo form at neutral pH in the absence of calcium.Dissecting the protein to study only the alpha helical domain,Peng & Kim (125) showed that at low pH this domain contains enough of a tertiary fold that native disulfides could befound when they oxidized a reduced species in the molten globulestate. On the basis of these results, along with CD and NMRdata, the authors believe that the molten globule is an expandednative state with no specific side-chain interactions. Furtherinvestigation by the same group showed that the beta sheetdomain is largely unstructured in the low pH molten globule(126). Such a bipartite structure is interesting because small-anglesolution X-ray scattering showed a unimodal distribution,which implies that the molten globule is roughly sphericalin solution (127). Using Raman optical activity measurementsand studying both ${alpha}$ -lactalbumin molten globules, Wilson etal (117) also found that both molten globules are native-likebut that the apo form is less sensitive to temperature denaturationsince it is more ordered.

CYTOCHROME C
Cytochrome c requires low pH and addition of salt to form amolten globule (128). The salt screens repulsive electrostaticinteractions caused by the acidic conditions and allows theprotein to collapse. This state has been characterized aspossessing an increased volume (129) and increased compressibility(130). Jeng et al (131) have shown that the N- and C-terminalhelices are responsible for most of the molten globule's secondarystructure. These two helices form during the early stages of folding (132) and contact each other in the native structure(133). Two groups (134, 135) have shown that packing interactionsbetween these terminal helices are just as important to thestability of the molten globule as they are to the nativestate. They mutated residues important to the interactionof the N- and C-terminal helices and found destabilization ofboth the native and molten globule states. This result impliesthat the molten globule of cytochrome c uses some native packingcontacts for stability. As an overall picture, results fromsmall-angle X-ray scattering (136) suggest that the cytochromec's molten globule best fits a structure containing a compactcore with random coils extending from it.

MYOGLOBIN
Depending on its environment, myoglobin in its apo form canfold into a number of molten globular states. Like cytochromec, apomyoglobin collapses from a largely unfolded conformationat pH 2 into a molten globular form upon addition of salt(137). This form of the molten globule is assumed to be similarto the one at pH 4.2 in the absence of salt (138) and has been characterized by Hughson et al (139). Their NMR analysis showedthat the A, G, and H helices arrange themselves in a native-likeconformation. These helices also form during the initial stagesof apomyoglobin refolding (140). In the folded state, thesethree helices pack against each other with large hydrophobiccontact areas (141, 142), while independently they have verylittle helical content (143, 144, 145). At pH 2 with sodium trichloroacetate, apomyoglobin forms another molten globulestate with more helical structure (146). This form is consideredto be further along in the folding pathway (140). Studyingboth molten globular forms, Nishii et al (138) found coldand heat denaturation of the two forms, indicating that hydrophobicitycontributes to the molten globules' stability. Using small-angleX-ray scattering to measure radius of gyration, they also showedthat the molten globules were less compact. Hughson et al (143) mutated residues important to the packing between the A, G,and H helices of the pH 4.2 molten globule and found no perturbationof stability from acid denaturation. In fact, overpackingthe interface caused an increase in stability. Approachingthe problem from a different angle, Kiefhaber & Baldwin(147) created mutations that increased the helical structureof the pH 4.2 molten globule. This mutant required higherconcentrations of urea to become denatured from a molten globulestate, showing that increasing the secondary structure stabilizesthe molten globule.

So far, these studies suggest that myoglobin folds accordingto the hydrophobic collapse model, but work published thispast year supports an alternate view. The same lab that performedmutational studies on the pH 4.2 molten globule repeated theseexperiments (148) using urea, instead of acid, to denaturethe protein. They found that the mutations at the A, G, andH helical interfaces destabilized the molten globule as wellas the native conformation. From their measurements the investigatorscomputed that packing interactions in the molten globule areabout half as strong as in the native state. Kataoka and coworkers(149) presented solution X-ray data that suggest the pH 2trichloroacetate-stabilized molten globule consists of a singlehydrophobic core surrounded by a disordered polypeptide chain.The evidence comes from the calculation of a distance distributionfunction. The trichloroacetate-stabilized molten globule atpH 2 showed a bimodal distribution, which is indicative oftwo different domains in this molten globule. Since this apomyoglobincontains only a single folding center, the authors attributedthe second mode in the distribution to the unfolded portionsof the chain. Native holomyoglobin and apomyoglobin, as wellas other molten globules [cytochrome c (149) and ${alpha}$ -lactalbumin(127)], possess unimodal distance distribution functions characteristicof a globular protein with a generally spherical shape in solution.Altogether, these experimental results lend support to the nucleationmodel.

Kinetic Experiments
While the previous studies looked at stable, equilibrium intermediates,the following experiments analyzed transient, kinetic intermediatesfound during refolding or unfolding of the protein. Usingmethods such as CD or NMR coupled to stop-flow techniquesto monitor the folded state of the protein, these experimentsusually find a quick burst phase of folding during which intermediatescannot be detected (121, 122). After this initial burst, thereis a slow phase while the molecule searches for its nativestate.

As discussed above, an early kinetic intermediate of both cytochromec and apomyoglobin has been found that contains characteristicssimilar to its related molten globule (131, 140). Investigatorshave found the same in other systems. For ribonuclease A,Yamaguchi et al (150) found a negative change in volume asthe protein went from a folded to an unfolded state by measuringthe Gibbs free energy difference during pressure denaturation. Refolding of the solvent-denatured protein produces two identifiable intermediates: The near-native intermediate requires a conformational change due to a proline isomerization to reach a completelyfolded conformation (151, 152, 153). The other intermediateoccurs early in refolding and resembles a molten globule state(154). Studies of the volume change upon refolding (155) andunfolding (156) of ribonuclease A indicate that an intermediatepossesses an increased volume akin to a molten globule, while NMR analysis provides evidence that an intermediate has featuresof a dry molten globule (96). Further investigation of theearly intermediate (157) corroborates results from equilibriumfolding studies. Because the authors discovered that the earlyintermediate is able to bind inhibitor, possesses hydrogenprotection factors similar to the near native intermediate,and has a developed ß-sheet, they believe that thisintermediate also contains significant tertiary structure.

Using staphyloccocal nuclease, Vidugiris et al (158) found thatpressure denaturation formed a transition state with a positiveactivation volume (basically an increase in volume of theprotein/watter system). The authors liken this swollen intermediateto a molten globule state. In another study looking at apomyoglobinunfolding, Barrick & Baldwin (159) describe an intermediatestate with developed helices, no strong tertiary structure, and a Gibbs free energy closer to the unfolded state thanthe native. From these results, they conclude that side-chainpacking is responsible for most of the stability of the nativestate. This apomyoglobin intermediate can be thought of asthe initial burst state, seen in much of the kinetic work(121, 122), in which the protein is compact and yet containssecondary structure. As discussed above, Uversky & Ptitsynliken the burst intermediate to a premolten globule state(120). Eliezer et al (160) provided a more general view ofthe solution structure of apomyoglobin's folding intermediate.Their small-angle X-ray scattering showed that the initialfolding intermediates at 20 and 100 ms are as compact as themolten globule and almost as compact as the refolded nativestate. In a quite recent analysis of dihydrofolate reductaserefolding, Hoeltzli & Frieden (manuscript submitted) monitoredthe resolved resonances of 6-19F-tryptophan and found strongevidence that the search for the correct residue packing causesthe slow rate-limiting step of refolding. In contrast, newtechniques able to look at the formation of the burst phase intermediate suggest that it contains secondary structureand residues with native tertiary contacts (for a review see161). Although these results are still preliminary, they providesupport for nucleation events in folding.

Conclusions from Experiment
Analyses of protein crystal structures (6, 20, 25), as wellas solution measurements (82), show that proteins in theirnative conformations possess tightly packed cores. Experimentalresults are not as clear as to when or how this well-packedcore arises. It is clear that proteins follow more than onefolding pathway; however, for all pathways, collapse occursearly. With the caveat that the data come from a limited setof proteins and experiments, we can construct the followinggeneral folding progression. From a denatured state, a proteincollapses into an initial burst phase intermediate (or forthe small, fast-folding proteins, folds dirrectly to the nativestate). This proposed premolten globular state contains a certainamount of secondary structure and tertiary contacts, but the protein's overall topology is incomplete. Next, developmentof the general chain topology occurs. As yet, not all theside-chains have packed well. In the end, the protein attainsits native conformation with a tightly packed core. Simulationsof folding (for a review see 162) as well as examination ofhinge motions (72) and mutational studies (86, 87, 88, 89) supportthe idea that packing can drive the last step in folding.Kinetically trapped intermediates could occur at any pointalong this pathway. Although still speculative, this picturedoes point out that both nucleation and hydrophobic collapseplay important roles in protein folding. It is uncertain exactlywhere and to what extent either affects each stage of the foldingprocess. In any case, experiments show that a well-packed coreis essential to achieving the native state during proteinfolding.

MODELING THE PACKING OF SIDE-CHAINS

Top
Previous
Next
Literature Cited

	MODELING THE PACKING OF SIDE-CHAINS

Early Work—Defining the Problem
The difficulty of ab initio protein structure prediction originatesfrom the enormous number of three-dimensional conformationsthat a chain of amino acids can adopt. A 100-residue proteinhas approximately 400 degrees of freedom: Each residue hastwo main-chain single-bond torsion angles, ${psi}$ and ${phi}$ , and onaverage two side-chain single-bond torsion angles, ${chi}$ 1 and ${chi}$ 2 (small side-chains have one ${chi}$ angle; large onnes havefour). Crudely assuming that a torsion angle accuracy of 10° is sufficient, each residue has 36 x 36 = 1296 independent ( ${phi}$ , ${psi}$ ) main-chain conformations, giving a main-chain combinatorialcomplexity of 1296 ¹⁰⁰ = 10³¹¹. Making the same assumptionfor the two side-chain torsion angles also gives a complexityof 10³¹¹. The two conformational spaces are the same size.However, the main-chain torsion angle errors propagate throughoutthe protein and are sequentially amplified. Side-chain angleerrors only affect the local conformation and propagate lessdirectly.

In 1987, Ponder & Richards (163) pointed out that usingthe criterion of "good packing" against the rigidly fixednative main-chain rules out the majority of side-chain rotamerconformations for residues in core regions. Side-chain rotamers,which are a tabulation of frequently observed conformations,have been proposed for many years (164), but Ponder & Richards(163) reduced these to a set of 67 different conformations that could account for most side-chains observed in real proteins(assuming an angle tolerance of ± 20°). While enumerationof these conformations is computationally feasible over afew neighboring residues, the task of enumerating all possibilitiesfor each residue in a 200-residue protein is computationallyintractable. (Specifically, there are on average 3.35 rotamersper amino acid (67/20), and this gives 3.35¹⁰⁰ ${cong}$ 10⁵³ combinations.)

One of the first attempts to actually predict the side-chainconformation given the correct conformation for the main-chaininvolved manual modeling (165). Working with the known X-rayconformation of the main-chain of flavodoxin, this test studyyielded a final side-chain prediction error of 2.41 ÅRMS (root mean square). Nevertheless, many large aromatic side-chains deep within the core of the protein were very badlypredicted. This in turn led to an error propagation cascadethroughout, causing satisfactory prediction for only 30–40%of the side-chain conformations.

Several investigators have performed local energy minimizationof a very few residues in the field of otherwise fixed proteinatoms (166, 167, 168). By restricting interest to situationswhere only a limited number of side-chains were replaced (e.g.by assuming that conserved residues remain inn similar conformationswhen two sequences have very high sequence similarity), thesemethods effectively focused their efforts on neighboring residues.Their success suggested that, if the problem could be separated into small sets of residues that interact little with eachother, the daunting combinatorics of the side-chain packingproblem could be surmounted.

A Possible Solution?
In 1991, four groups working independently each discovered amethod that naturally broke the combinatorial problem intomanageable pieces (169, 170, 171, 172). When a protein isstripped of all its side-chains, and the native main-chain is used as a rigid constraint to repack all the side-chain atoms, these varied methods could achieve an accuracy of 1.8 ÅRMS error over all side-chain atoms.

These four methods all rely on the van der Waals energy to eliminatebad side-chain arrangements. They differ very much in howthey generate possible side-chain conformations and how theychoose between them. The method of Lee & Subbiah (169)utilizes no database information, making it the most physicallybased method of the four. Side-chains are allowed to exploretorsion angles in 10° intervals, and simulated annealingis used to optimize the arrangement of neighboring side-chainsby minimizing the van der Waals energy. Two of the methodsuse a set of rotamers taken from known protein conformationsand optimize an energy function [which can include hydrogen-bondingand electrostatics (171)] using Monte Carlo (MC) minimization(170) or a genetic algorithm (171). The fourth method (172) relies more heavily on known protein structures and the surprisingfinding of Jones & Thirup (173) that almost all segmentsof main-chain conformation recur in proteins. In this method,van der Waals packing energy is used to select plausible segmentsof known protein structure, borrowing the side-chain conformation.Rather than optimizing the side-chain conformations, it introducessome variability in selecting chain segments, averages atomiccoordinates to enhance the signal from the common conformations,and then regularizes the stereochemistry with energy refinement.

Since all these methods primarily rely on only extremely simplevan der Waals packing in their energy functions, a betterassay of accuracy is the predicted error in the well-buriedside-chains. Considering only the half of all residues thatare less solvent-exposed [<30% surface area accessible tothe solvent (174)] significantly improves the prediction accuracy.The only ab initio method, using simulated annealing to minimizethe van der Waals energies in a finely discretized torsionalspace (10° for the ${chi}$ angles), was accurate to 1.25 ÅRMS (169). The genetic algorithm approach (175, 176) thatcombinatorially mates rotamers selected from a 109-member rotamerdatabase (171) was accurate to 1.54 Å RMS. The MC energy minimization over a similar rotamer database was accurateto 1.6 Å RMS (170). The segment-matching method wasaccurate to 1.37 Å RMS, in spite of its use of onlythe native C ${alpha}$ positions rather than the entire main-chain. Thissuccess by four different methods that all rely on packing to eliminate bad choices proved the foresight of Ponder &Richards (163) was indeed correct.

Recent Refinements—A Classification
Over the past four years, a flood of new methods, as well asimproved versions of the early ones, have been reported. Thebest of these, like those of Lee (177) and Vásquez(178), consistently break the 1 Å RMS barrier over alarge set of proteins, while a few others (179, 180, 181) hover between 1 and 1.1 Å RMS error. Of the remaining recentmethods, all report average errors of less than 1.45 ÅRMS over a test set of 10–60 proteins (182, 183, 184,185, 186).

The four methods discovered independently between 1991 and 1992employ surprisingly different approaches. Classifying theseand the newer methods helps highlight what is necessary forsuccessful prediction. Methods that predict side-chain conformationfrom a known backbone conformation involve two steps: (a)choosing a set of possible conformations for each side-chain,and (b) choosing the conformations of each side-chain to optimize packing for a given fixed main-chain.

POSSIBLE CONFORMATIONS
The set of possible conformations is either knowledge based(taken from known three-dimensional structures of proteins)or defined by simple geometrical considerations. Most methodsare knowledge based (170, 171, 178, 179, 180, 183, 187), followingthe use of rotamer libraries by Ponder & Richards (163).Variants in both the size and content of these libraries havebeen attempted (180, 187, 188, 189). The latter include somestudies that use a rotamer set customized to match the localmain-chain of the particular side-chain. (180, 189). Others(172, 181, 190) take one or a small number of fragments fromknown protein structures using a local fit to the main-chainto choose fragments. A few investigators (169, 177) disregardthese database approaches and instead vary the side-chain single bond torsion angles in 10° increments.

OPTIMIZING PACKING
Most methods use some type of search strategy to find the combinationof side-chain conformations that optimizes packing. Good packingis generally assumed to correspond to a favorable value ofthe van der Waals energy, with its strong steric repulsionand weak long-range attraction, but more complicated energyterms are sometimes included (171). Of greater importancethan the energy is the search method used to find the best combinationof side-chain conformations. Simulated annealing is surprisinglyeffective at finding the optimal packing corresponding to side-chainarrangements found in native proteins (169, 186, 191), as isthe related MC minimization method (170). Genetic algorithmshave also been used (171, 192). More elaborate search methodshave also been used, such as "dead end elimination" (184,193) and the A* algorithm (194), and these have been combinedwith other heuristics (187, 193). More physically based methodssearch with MD simulations (179, 180, 189, 192, 195), self consistentmean-fields (177, 183), and Gibbs sampling utilizing heat baths (178). One method (172) simply pastes together segments foundin known proteins, subject to their packing well into thegrowing structure.

AB INITIO METHODS
Only a handful of methods that do not rely on protein-derivedknowledge have worked well. One that relies on MD "annealing"of successively added atoms beyond the Cß atoms enjoyedsome success (196) but has since been reported to be inferiorto rotamer-based methods (192). A related method of annealing"sprouted" side-chain atoms, again using MD, has only been reportedto work on small peptides (197). The most successful ab initio methods (169, 177), mentioned above, rely on simple van derWaals energy in conjunction with complete sampling of torsionangle space.

Assessing the Accuracy
RANDOM OR WORST RMS
The success of these methods must be put into context by consideringthe RMS expected if all side-chain conformations were (a)randomly predicted or (b) predicted as badly as possible.The random RMS was estimated to be 3.1 Å and the worstRMS to be 4 Å for a 100-member rotamer library (169, 170). Later work gave similar random RMS between 3.3 Å and3.5 Å, depending on the size of the rotamer library(187). Many studies have answered the opposite question ofhow well the best rotamer-based prediction can represent thenative structure: RMS values range from about 0.5 Å forthe large 624-member rotamer libraries (170, 178, 179, 187)to 1.0 Å for the original 67-member rotamer library(163).

EXPERIMENTAL ACCURACY
The answer to the question "What error value corresponds toan excellent prediction?" can be found in a rotamer-independentmanner. It has long been known that when X-ray structuresof the same protein are determined by two different laboratories or in two different crystal forms, the main-chain atoms differby about 0.5 Å RMS (198, 199). The side-chains can differby as much as 1.5 Å RMS (199), but for the more buriedside-chains not involved in crystal contacts, the differencecan be up to 1 Å (198). Judged against a side-chain RMSof 3.1 Å being random and a RMS of 1 Å being thebest possible, the fact that automatic methods routinely achievevalues as low as 1.25 Å RMS suggests that the side-chainpacking problem may be solved.

TORSION ANGLES
Another measure of fit is the percentage of side-chains for which the torsion angles are correctly predicted. For theburied residues, the better side-chain packing algorithmsusually predict correctly (within 40°) 90% of the ${chi}$ 1 anglesand 80% of ( ${chi}$ 1, ${chi}$ 2) angle pairs (169, 177, 178, 181, 187).When all residues are considered, these figures drop to 80% and 70%, respectively. The percentage correct obviously dependson the match criteria: With stricter criteria (within 20°or 30°), these values are reduced by about 10% (170, 172,181, 200). These predicted values must be compared with thebest that can be achieved by rotamer libraries. Allowing adeviation of less than 40° from the angle derived fromX-ray information, even the smaller rotamer libraries can oftencorrectly capture the native side-chain conformations for some95% of the ${chi}$ 1 angles and 90% of the ( ${chi}$ 1, ${chi}$ 2) pairs (178). Withthe stricter criterion of being within 20° of the anglefrom X-ray structures, these values drop to 85% and 75%, respectively(200). It is encouraging that for the buried side-chains,the success rate of prediction is only 10% less than the bestpossible with rotamer libraries.

PREDICTIVE SUCCESS
In terms of claimed accuracy, the ab initio method of Lee(177) and the rotamer-based method of Vásquez (178) aremarginally superior to all others. Lee has published predictionsprior to experimental X-ray determination that have provedto be accurate. He has reported RMS errors of 0.68–0.89Å in side-chain prediction for T4 lysozyme mutants (201),1.11 Å on ${lambda}$ -repressor mutants (202), and 0.97 Å RMSon polymeric HLA alleles (203). While some caution shouldbe expressed since these predictions are only for a few buriedresidues, the results do suggest that the best side-chain packing methods can be useful.

Why Is This An Easy Problem?
Since it appears that the packing of side-chains can be wellpredicted, some investigators have suggested the problem isnot really combinatorial in that the allowed side-chain conformationsdepend on the local main-chain environment (180, 187, 204).Methods that choose the side-chain conformation based onlyon the local main-chain are about 20% less accurate than methodsthat allow full combinatorial packing (178, 183). This remaining20% in accuracy can only be obtained by considering combinatorialpacking (178, 180, 183, 187).

Main-Chain Movement
It is becoming increasingly clear that the assumption of a fixedmain-chain during combinatorial repacking is not generallyvalid. Attempts have been made very recently to relax thisassumption. A clever method, which allows main-chain and side-chainflexibility, has been applied to the special case of repeatingcoiled-coil structures; it is able to predict the buried side-chains almost as accurately as when the perfect main-chain is available (195). Koehl & Delarue (205) have applied the mean-fieldapproach, so successful for side-chains (177, 183), to themain-chain with promising results. Wilson et al (179) haveproposed the use of rounds of alternating side-chain packingonto a fixed main-chain and full MD minimization. The multiplecopy and mean-field approaches also appear to be particularlywell suited to allowing main-chain shifts (183, 206, 207).

GETTING TO THE ENDGAME

Top
Previous
Next
Literature Cited

	GETTING TO THE ENDGAME

How Close Is Close Enough?
In order to get to the endgame, one needs a backbone that isvery close to native. How close is close enough? The questionof whether packing optimization schemes can model side-chainsaccurately upon fixed imperfect backbones has been under intensescrutiny. Many studies have considered the repacking of correctlyaligned target sequences onto fixed homologous template structures,employing the same algorithms used when the ideal backbonewas provided (168, 179, 180, 183, 188). Recently, in a more systematic study that spanned the full range of possible sequence identities within certain protein families, Chung & Subbiah(208, 209) observed a monotonic decrease in buried side-chainprediction accuracy as the sequence identity diverged andbackbone deviation increased. They estimate that when thetemplate is more than 2 Å RMS error from the native backbone(corresponding to ~25% sequence identity), the side-chain predictionaccuracy approaches the random expectation of 3.1–3.3 ÅRMS (169, 187, 209).

In the absence of general methods that accommodate movable backbonesin side-chain prediction, it appears that backbones within2 Å RMS of the native structure are required for accuratemodeling. Backbones as accurate as these are sometimes availableif the structure of a close sequence homologue is known andthe two sequences are correctly aligned. In the general case,how is it possible to obtain folded backbones that are sufficientlyaccurate?

Threading Methods
In one approach, known as threading or fold recognition, a newsequence is aligned upon a known three-dimensional structure,and each sequence-structure alignment is scored via an energyfunction. Threading has identified compatible folds that areundetectable by conventional sequence alignment methods (210,211). However, success in recognizing a related fold doesnot imply success in building an accurate model using the relatedfold as a template. The alignment of the new sequence on theknown backbone has to be almost perfectly correct to get therequired 2-Å accuracy (adjacent residues are about 4Å apart). Results from the threading predictions illustratedthe various shortcomings of available alignment and/or scoringmethods (212). Moreover, even given perfect alignments, backbonesgenerated by threading methods may not be useful if the alignedsequences show less than 30% identity (208). Threading specializesin finding such folds, so it is unlikely to provide acceptable backbones for standard side-chain prediction methods, evenif the alignment were optimal. In any case, at the presenttime many proteins of interest are new folds for which thereis no threading target. Hence, we do not regard threadingin its current form as a viable pathway to the endgame of folding.

Ab Initio Folding
In ab initio methods, a fold for the new sequence is generatedwithout directly using the known fold of any other protein.This is accomplished either by (a) a broad and even samplingof conformational space by an energy-independent method, followedby screening of the resulting candidate folds by an energyfunction or (b) minimizing the conformational energy of apolypeptide as it folds through an approximately continuous conformational space. In either case, the level of detailincluded in the structural representation must balance thecomputational tractability and geometric accuracy of the model.Lattice models can be computationally feasible, even to thepoint of enumerating folds exhaustively (213), but such foldssacrifice secondary structure features and are generally less accurate than 5 Å RMS. Off-lattice discrete models,such as those possessing six states per residue, can reproducethe native backbone to 2 Å RMS (214), but generatingsuch folds exhaustively is beyond the power of today's computers(a chain of length 100 has 6¹⁰⁰ = 10⁷⁰ folds). For minimizationmethods, complex lattice and discrete representations hinder the search for the energy minimum because they make the energylandscape more rugged and increase the number of moves necessaryto traverse the conformational space. In spite of these limitations,ab initio folding approaches have made progress, routinelyachieving structures with accuracy up to ~4 Å RMS error(215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226,227). Other methods, especially those that use experimentalsecondary structure constraints, fare even better. In the next section, we review approaches to ab initio folding that haveproduced folds within 2–4 Å RMS of the native structure.

Discrete-State Models and Energy Functions
A simple discrete-state model has been described by Park &Levitt (228). Their optimized four-state off-lattice representationis able to build backbones within 2 Å RMS from the nativebackbone. Even with this model, exhaustive enumeration isimpossible, as 4¹⁰⁰ = 10⁶⁰ is intractable. If one enforcesthe native secondary structure as an external constraint, this model has no more than about 200,000 folds for each protein.This is a manageable number of folds that evenly and broadlysample phase space while providing candidates that take foldinginto the end-game.

Providing such candidates in itself is not useful unless anenergy function can successfully distinguish the native-likefolds from the entire set of folds generated. This issue presentstwo questions: (a) Can energy functions distinguish betweenthe near-native folds and those that are grossly misfolded?(b) Can the energy functions distinguish between the nativestructure and the near-native folds? The first question dependsas much on the quality of the representation as on the effectivenessof the energy function; i.e. energy functions are useful onlyin the context of a representation capable of generating suitablynear-native structures. The second question asks whether ornot the energy function can tell a true native fold from thebest near-native decoys; it thus assesses the resolution ofthe function and indicates whether further minimization of thefunction can in principle drive the conformation towards a more native-like state.

Park & Levitt (229), using a basis set of six energy functions,report that the native fold can be recognized very effectivelyif one combines energy functions that stress complementaryfactors, such as nonspecific hydrophobicity (a general compactingforce) and residue-specific pairings. Furthermore, the bestof the native-like folds usually rank very high in the energy-sortedlist. On average, the best combinations of energy functionsplace the native-like folds in the top 1% of the score-sorted list (229), although there are always many grossly misfoldeddecoys with energies more favorable than some of the near-nativefolds. Therefore, if one were to apply an effective energyfunction as a screen of the entire decoy set (for example,by taking the top half of the energy–sorted list), theconcentration of the near-native folds in the high-scoring subsetwould increase, but the highest-scoring folds in the subsetwould not all be near-native. In other words, RMS deviationand energies are not highly correlated in the RMS range exploredin this study (229). More encouraging is that the best energyfunctions typically score the native fold more favorably thanall the decoys, including those within 2 Å RMS.

Energy Minimization and Search Strategies
Methods that use energy minimization to move through phase spacehave shown promise in folding to near-native conformations.Recent work by Mumenthaler & Braun (230) describes a self-correctingdistance geometry method for predicting the tertiary arrangementof small globular helical proteins. This method, like theone by Park & Levitt (228, 229), assumes that the helicalsegments are known in advance; only the ( ${phi}$ , ${psi}$ ) dihedral anglesof loop residues are adjustable (though constrained to combinationsthat are commoonly observed in the database for each residuetype). First, the method predicts whether each residue issolvent-exposed ("outside") or buried ("inside"), using analgorithm that exploits multiple sequence alignment information(231). Upper limits for the distances between the three types of residue pairings (inside-inside, outside-outside, and inside-outsidee) are calculated as a function of the size of the protein. Theminimization engine then applies these distance constraintsin a clever algorithm that dynamically adjusts constraintsover each iteration of the structure generation cycle. Thus,rather than having an energy function per se, the method relieson a "target function" that depends on the predicted constraints.The structures with the fewest constraint violations tend to cluster within 3 Å RMS of the experimentally determinedstructure, although only the helical residues were includedin the RMS calculation. The final predicted structure, takenas the average structure in the low-violations cluster, canbe accurate to 2.3 Å RMS of the native structure. Becausethe constraints are adjusted to the structures during theprocedure, there is no path-independent energy function availablefor further minimization. Overall, six out of eight test proteinsconverged to near-native predictions ( $<=$ 3 Å RMS error),but none were within 2 Å. Nevertheless, this methodcan be a useful tool for taking folding into the end-game,assuming that secondary structure prediction methods continueto improve.

A similar minimization approach was developed by Sun et al (232).Like the two procedures discussed above, this method alsobegins with the known secondary structure elements in orderto reduce the conformational space to be searched. Their conformationalsearch engine is two tiered and is powered by a genetic algorithmthat operates on a string of paired ( ${phi}$ , ${psi}$ ) dihedral angles describingthe conformation of the protein. First, mutation and crossoveroperations are performed at randomly chosen rotatable residues(i.e. those not in secondary structure). Mutations are random selections from a set of dihedral angle pairs derived fromthe structure database. The second step refines the searchby perturbing randomly chosen unconstrained torsion anglesslightly in order to probe the local energy landscape forminima. The selection method is an energy function that modelsthe hydrophobic interaction (1) and is an extension of the simple hydrophobic-polar models of Dill and coworkers (233). Theresults were encouraging. Out of ten test cases, four of thelowest-energy models were within 4 Å RMS error, butnone of the minimized structures achieved 2-Å accuracy.Moreover, many of the native structures had energies much worse than the minimized structures, thus limiting the utility oftheir highly simple energy function in the endgame.

Let us summarize the strengths and shortcomings of the ab initiomethods discussed above. The results of Park & Levitt(228, 229) suggest that an effective energy function (of whichthere are several) yoked with the proper search strategy candrive near-native folds towards the native fold. However,the same function cannot reliably recognize near-native folds, even the best ones, from the entire set of decoys. For near-native structure generation, the minimization method of either Mumenthaler& Braun (230) or Sun et al (232) might be a better alternative.However, these methods are not fail-safe, for they do notalways converge near the native structure.

In the ab initio methods discussed above, folds were generatedeither exhaustively (229) or from random tertiary arrangements(230, 232). As close as these methods can get to the nativefold, their accuracy is hampered by the reduced complexityof the model, the energy functions that drive the foldingof the chain, or both. In the next section, we address theseconcerns. Energy functions are challenged to recognize nativefolds from all-atom representations very close in conformationto the native fold.

Discriminating Native from Near-Native Conformations
A key requirement of an energy function able to drive the searchtowards the end-game is that the native conformation havea lower energy than the near-native conformations. Such setsof near-native conformations can be generated by deformingthe experimentally determined structures using methods suchas MC and MD simulations. Energy functions are then appliedto these test sets in order to assay their discriminationpower.

The method developed by Wang et al (234, 235) was the firstattempt at recognizing the native fold from large decoy setsof near-native and compact structures. This method is basedon the atomic solvation potential of Eisenberg & McLachlan(236), grouping atoms into 17 chemically related "molecularfragment types," each with its associated solvation parameter. These parameters were obtained by a training algorithm thatmaximizes the solvation energy difference between the nativeand a large set of compact nonnative structures generatedby MC and MD simulations (235 and references therein). Thesolvation parameters were then used to evaluate native structuresof a separate test set of decoy structures generated by MC and MD. The MC-generated structures were selected to be compact(the radius of gyration did not exceed that of thee nativestructure plus 5%) and within predetermined RMS deviationfrom the native structure (up to 5 Å maximum). The MDsimulations were carried out at room temperature (300 K) andhigh temperature (500 K); the average RMS errors for the 300-Kand 500-K simulations were 4.1 Å RMS and 8.0 ÅRMS, respectively. More than 8200 nonnative MC and MD decoyswere furnished for each of 11 test proteins, of which only7 on average were misrecognized as native (having a more favorableenergy score than the expperimentally determined structure).The solvation energy roughly correlated with the RMS deviationbetween the native and decoy structures. All of the misrecognizeddecoys, or false positives, were structures very close tothe native (<1 Å RMS). Wang et al (234) also demonstratedthat their method compared favorably against a battery ofstandard energy functions: MD force fields, statistically derivedcontact potentials, three-dimensional profile methods, knowledge-basedpotentials of mean force, and others (188, 210, 237, 238, 239,240, 241, 242).

In a related study, Huang et al (243) explored the ability ofa very simple hydrophobic contact function (244) to recognizenear-native decoys generated by MD simulation in solutionat room (298 K) and high (498 K) temperatures. Five smallproteins formed the test set. Overall, the average RMS deviationsfrom the native structure were 1.5 Å (at 298 K) and 4.1Å (at 498 K). As in the earlier studies (234, 235, 243),native structures were readily identified from the sets ofdecoy structures: There were only 330 false positives outof 10,000 (combined room and high temperature runs for theffive proteins). Likewise, the energy function is strongly dependent on the extent to which the structures are deformed: Only onefalse positive exhibited an RMS deviation more than 2 Åfrom the native structure (243).

What is the impact of these two studies on how the endgame isplayed? Both appear to be successful at identifying nativefolds from compact, near-native folds, a quality that otherfunctions apparently lack (234). Huang et al (243) note thatthe decoy set used in their study is perhaps a more rigoroustest, given the lower RMS deviations produced from MD simulations.Indeed, demonstrating that simple energy functions can discriminatenative from near-native structures in this RMS range (0–2Å) is important. Given that ab initio methods can providefolds that are quite close to the native (around 2 Å),it is important to use methods such as MC and MD simulationsto probe the relationship between energy and molecular conformationwithin 2 Å RMS from the native. However, even more challenging near-native test sets are needed to assess the true discriminationpower of existing potentials. High-temperature MD simulations(234, 235, 243) and the MC simulations of Wang et al (235)compromise the integrity of the secondary structure and loosenthe packing of the tertiary structure. Even the 298-K MD simulationsin solvent by Huang et al (243), which depart from the nativeby an average of only 1.5 Å RMS, undergo a 2–3% increasein the radius of gyration. A function that stresses hydrophobicity(i.e. nonspecific compacting force), such as the one by Huanget al (243), is sensitive to minute changes of this type.Corroborating evidence is seen in recent work by Levitt andcoworkers, who have tested the performance of 18 energy functionson this set of MD structures (245). This study indicated thatother energy functions emphasizing hydrophobicity also excelledat native fold discrimination.

Although RMS deviation imperfectly serves as a coordinate alongthe folding trajectory, it is encouraging nonetheless to confirmits strong correlation with energy functions (243). Althoughneither study attempted to minimize their respective energyfunctions using near-native structures as starting points,we challenge future studies to progress along these lines.

CONCLUSIONS

Top
Previous
Literature Cited

	CONCLUSIONS

Proteins are close-packed both in the solid state and in solution.In fact, they are probably the most tightly packed form oforganic matter. This close-packing is related to functionin that it provides a rigid core on which to arrange catalyticside-chains in enzymes. Loose-packing is often associatedwith flexible hinges and conformational changes, whereas tight packing correlates with better stability. How such tight-packingarises in protein folding is still unclear, although therehas been enormous progress in characterizing the packing ofpartially folded intermediates. However it arises, this close-packinglimits the number of possible arrangements of the side-chains,which has led to methods capable of predicting side-chain packingon a known, rigid main-chain. These same methods are applicableto homology modeling provided the main-chain "borrowed" fromthe related structure is close enough (within 2 Å).If no homologous structure is known, other methods can sometimesgenerate main-chains that are almost close enough (<3.5Å RMS). It is crucial to have an energy function that can recognize the folds that are closer to the native structure.

Throughout the review, we have argued that packing forms a strong constraint on protein structure, severely restricting thenumber of possible structures. However, in the earlier stagesof the folding process, particularly those relating to theformation of the overall fold, it is believed that packingis much less important. This theory has been borne out inexperimental studies demonstrating how tolerant a fold is tomany random mutations (89, 246, 247, 248). It has also beensubstantiated in theoretical studies that show how surprisinglyeasy it is for a protein of random sequence (a "random hetropolymer")to close-pack in an approximate sense (249, 250, 251).

A number of challenges lie ahead. Perhaps the greatest is tounderstand how a protein close-packs its residues during thelatter stages of folding. The early stages are generally consideredto be dominated by nonspecific hydrophobic interactions. Anotherchallenge is to understand how packing affects function: Ifloose-packing is essential for function, it should be possibleto design proteins that are too stable to function as catalysts. In the area of computer simulations, we expect progress inthe consideration of main-chain flexibility, derivation ofstrongly discriminating energy functions, and generation ofdiverse sets of decoy folds. For structure prediction, theproblem of packing side-chains using a near-native backboneseems almost completely solved. The challenge now is to generatemain-chains sufficiently close to the native backbone to allowpacking algorithms to be successful. It also seems likely that designing small helical proteins will be easiest, and theirdetailed structure could be predicted over the next fiveyears!

Reprint (PDF) Version of this Article

PubMed Citation

Related Articles in PubMed

Download to Citation Manager

Search Medline for articles by:
Levitt, M. || Tsai, J.