Quite Interesting PDB Structures


Phaser - a "stunning" method for solving crystal structures
Quips Index


Prof. Randy Read, University of Cambridge, UK.
ref. 1 ref. 2 is a program used by crystallographers to solve the so-called “phase problem”, which is a crucial step in the process of determining the structure of a biomacromolecule. Phaser was developed in the laboratory of Prof. Randy Read in Cambridge, UK, was first released in 2003, and has now been incorporated into two major crystallographic software packages, CCP4 and Phenix. Combining sophisticated algorithms and automated rapid search calculations, Phaser solves structures by “recycling” previously solved ones, a process known as Molecular Replacement (MR). Phaser’s success is illustrated by the fact that at the time of writing more than 12% of all crystal structures in the Protein Data Bank (PDB) have been solved with it.

In this episode of Quips, we look back at the first novel structure that was solved with Phaser: PDB entry 1xg5 - the structure of a human dehydrogenase enzyme determined by the Oxford Structural Genomics Consortium (SGC) in 2004. The structure revealed that each subunit of this enzyme has a classic Rossmann fold view-1 consisting of a core of parallel β-strands flanked on either side by antiparallel α-helices. We take a look at how this structure was solved and why methods like those used in Phaser will be increasingly important in the future.

Facing the phase problem

Nowadays, X-ray diffraction data can be collected from macromolecular crystals at mind-boggling speeds. CCD detectors with high read-out rates together with high-brilliance synchrotron sources make it possible to collect hundreds of diffraction images and thousands of intensity data points determined from them in minutes.

But the most crucial bits of information that are missing in the X-ray data are the relative phase relationships among the diffraction data. In order to calculate an interpretable 3D image of the electronic structure of the molecules in the crystal this phase information is crucial.

Phases can be calculated once the majority of atoms in a structure have been positioned accurately. But how do we get the atoms positioned to start with? Rough estimates of phases can be obtained from clever calculations involving small numbers of heavy atoms. Reactive metals such as mercury, gold and platinum provide the heavy atoms for phasing methods that used to be the workhorses of X-ray structure determination. The initial phase estimates can be refined as more of the lighter atoms in the structure are put in place.

MR methods such as those in used in Phaser take a different approach. They attempt to position a fairly large number of lighter atoms all in one go - typically the conserved core atoms of the macromolecule. But how can we get a reasonable idea of how these might be arranged?

He ain't heavy - he's my... structural... brother

Proteins are folded chains of amino-acid residues while nucleic acids contain helical arrangements of paired nucleotide units. It turns out that macromolecules that are evolutionarily related tend to have very similar structures - their folds are conserved.

The Rossmann fold, found in many dehydrogenase enzymes view-1, was in fact one of the first cases where this phenomenon was observed for proteins. Often, whether or not two proteins are related by evolution can be inferred from the fact that their sequences display similarities. Thus, to identify proteins with a potentially similar structure to the protein we are studying, we can do a sequence-similarity search of the PDB.

If a protein of known structure with a similar sequence as our target molecule can be identified, it is likely that their core structures are similar as well. However, it has been found that overall structure (or fold) is conserved even as sequences evolve and diverge. This is illustrated in view-2 which compares the enzyme structure of PDB entry 1xg5 with several similar structures from the PDB. Each has a core Rossmann fold, but with a sequence that is increasingly different from that of the 1xg5 enzyme. You can see the family resemblance in the core structures, which are conserved despite increasingly divergent sequences and larger changes in surface loops. As noted by Chothia and Lesk over 25 years ago ref. 3, we can observe here that decreasing sequence identity (SI) is accompanied by increasing structural difference (expressed as RMSD for the set of core atoms).

The crystallographers who were trying to solve the structure of 1xg5 selected PDB entry 1edo for their MR calculations with Phaser. This is the structure of a plant dehydrogenase that has about 32% sequence identity to the human enzyme. You can see that in retrospect this was a reasonable choice as it has a similar fold to the human dehydrogenase view-2.


Obviously, view-2 represents 20:20 hindsight. The 1edo structure is no help in solving the human dehydrogenase structure until it has been placed accurately over each copy in the unit cell of the crystal of the human enzyme for which the data were collected. The unit cell is the smallest part of a crystal from which the rest of the crystal can be constructed purely by translations, like the tiles of a bathroom wall. In most cases, the unit cell contains two or more identical subunits (called asymmetric units) that are related to each other by so-called crystallographic symmetry (consisting of rotations and translations, in the case of biomacromolecules). An asymmetric unit, in turn, may contain one or more copies of the molecules that were crystallised. (If there is more than one copy, this is called non-crystallographic symmetry, or NCS, a term which you may encounter in papers describing crystal structures.) To solve a structure, one needs to locate the atoms in one asymmetric unit.

View-3 illustrates the problem. The 1edo plant dehydrogenase is shown in its own unit cell. It will only be useful for phase calculation for 1xg5 once it is a properly positioned stand-in for the human dehydrogenase in the 1xg5 unit cell. The 1edo structure needs to be spun around to the right orientation (a rotation) and then shifted through space (a translation) to take the place of the unknown 1xg5 human dehydrogenase in its unit cell. Only then will its core atoms be close enough to the actual ones to be of use in calculating a set of initial phases for the 1xg5 human dehydrogenase data. Finding the correct rotations and translations is the problem that MR programs attempt to solve. The structure that is used as a substitute of the correct structure is called the search model.

As view-3 makes clear, things are even more complicated in the case of 1xg5 as there are actually four distinctly different rotations and translations needed to make a complete tetrameric assembly of the human dehydrogenase. Once one orientation and position have been identified correctly, it will make the search for others easier. However, the other three are contributing noise during the calculations of the first solution. Worse still, if the first solution is incorrect, it can sabotage all subsequent attempts to find the other three.

Phaser's box of tricks

Computer programs for MR have been around for several decades. Phaser's strength is that it applies a series of sensitive yet rapid search methods to the problem and uses a Maximum Likelihood approach. Maximum Likelihood uses a careful assessment of the probability of obtaining the experimental data given the atomic model, taking into account experimental errors and errors in the model (due to the incompleteness and incorrectness of the search model). Thanks to the speed of modern computers, Phaser can perform a rapid search in which it scans a large number of trial orientations and positions of the core atoms spaced through the new unit cell. Each trial is scored for the probability that it gave rise to the observed data, which helps in identifying the most likely rotations.

When multiple positions have to be found, Phaser's automatic search will try many different options, allowing it to backtrack out of any blind alleys, in a way that would be difficult to do manually.

The classic dehydrogenase Rossmann fold. The order is 1-A-2-B-3-C-4-D-5-E-6-F where numbers are β-strands and letters are α-helices.
This feature was important in the case of the 1xg5 structure, as there were four copies of the enzyme that turned out to be subunits of a tightly packed tetrameric assembly. One of the interfaces in the tetramer involves inter-digitation of α-helices and the other interface reveals that β-strands bridge the Rossmann fold β-sheets of two subunits view-4. The β-strands at the interface are not part of the classic Rossman fold (shown on left) but additional strands that allow interfaces to form.

At the end of a successful MR calculation, Phaser gives estimates of phases and likely initial positions for the core atoms. The final step is for the crystallographer to check for new features in the calculated electron-density map that are consistent with the known contents of the crystal (e.g. sidechains, loops or ligands that were not present in the search model). Such new features serve to validate the solution and can be included in the model to provide better phases and a new map which will hopefully reveal even more structural details.

Molecular Replacement becomes more powerful as the PDB grows

It is generally believed that biological macromolecules can have only a finite number of stably folded conformations (probably in the thousands), which developed early in evolutionary history. Structural biologists discover more and more of these unique folds every year. In the case of RNA, for example, many new motifs were discovered when the first high-resolution ribosome structures were determined.

More known folds and better MR software means that more and more macromolecular structures can be solved by the MR method. Already there are structure-solution services available that use representative examples from the PDB archive for exhaustive MR calculations to solve new crystal structures.

Further exploration

You can use the PDBeFold server to explore structural similarities between protein structures in the PDB archive. This mini-tutorial shows how you can compare multiple structures just as in view-2.

If you are trying to solve a new protein structure and you want to investigate if there are any structures in the PDB that might have a similar fold, you can use the amino-acid sequence to carry out a search against the entire PDB using the PDBeXplore sequence browser.

You can find more about the Oxford SGC by visiting their website. Many of the SGC structures can be studied interactively.