If you look it up in a dictionary, "validation" is defined as:
- to declare or make legally valid
- to mark with an indication of official sanction
- to substantiate or verify
Many statistics, methods, and programs were developed from the 1990s onward to help identify errors in protein models. These methods generally fall into two classes:
- methods in which only coordinates are considered (such methods often entail comparison of a model to information derived from structural databases), and
- methods in which both the model and the crystallographic data are taken into account.
Alternatively, one can distinguish between:
- "weak" methods that essentially measure how well the refinement program has succeeded in imposing restraints (e.g., deviations from ideal geometry, conventional R-value), and
- "strong" methods that assess aspects of the model that are "orthogonal" to the information used in refinement (e.g., free R-value, patterns of non-bonded interactions, conformational torsion-angle distributions).
An additional distinction can be made between:
- methods that provide overall (global) statistics for a model (such methods are suitable for monitoring the progress of the refinement and rebuilding process, or for assessing the overall quality of a model), and
- methods that provide information at the level of residues, small molecules or atoms (such methods are more useful for detecting local problems in or assessing local quality of a model, e.g. catalytic residues, ligand-binding sites, or interfaces).
It is important to realise that almost all coordinate-based validation methods detect outliers (i.e., atoms, residues or ligands with unusual properties): to assess whether an outlier is an error in the model or whether it is a genuine, but unusual, feature of the structure, one must inspect the (preferably unbiased) electron-density maps! If an outlier is most likely an error in your model, you probably want to try and fix it before depositing the model and submitting your paper. If, on the other hand, it appears to be a genuine feature of your structure, convincingly supported by the experimental data, you may even want to mention it in your paper.
If you are primarily interested in assessing the overall quality of a model (e.g., to decide if it's good enough to use as a starting point for comparative modelling), strong and global quality indicators are the most useful. Examples of such criteria are:
- Free R-value
- Packing or clash score
- Ramachandran plot
If, on the other hand, you are interested in identifying any local issues (to decide if the active site of a protein has been modeled reliably enough to use it for the design of ligands), strong and local methods are most suitable. Examples of these are:
- Real-space fit
- Main-chain torsion-angle combinations (Ramachandran)
- Side-chain torsion-angle combinations (rotamers)
Unfortunately, in many (especially older) papers that describe macromolecular crystal structures, "quality criteria" are quoted that do not necessarily provide any indication whatsoever of the actual quality of the model. Examples are:
- Conventional R-value
- Geometry, i.e. RMS deviation of bond lengths and angles from "ideal" values
- Average temperature factor (or B-factor) of the atoms in the model
It is also important to realise that every quality check that a model passes provides a necessary but insufficient indication of the model's correctness: a good model makes sense in just about every respect.
Another important maxim is that extraordinary claims require extraordinary evidence. For instance, claims about distortions of a ligand or catalytic residues, or unexpected features such as cis-peptide bonds or D-amino acids, are more credible if they are based on careful analysis of 1.5Å data than when they are "backed up" by a 3.5Å dataset (see the resolution movie on the previous page).
Fortunately, nowadays there are helpful validation reports available for all models in the PDB to help you assess their quality (we will come back to these reports later in the practical).