An Automated Approach For Clustering An Ensemble Of NMR-Derived Protein Structures Into Conformationally-Related Subfamilies.
Lawrence A. Kelley1, Stephen P. Gardner2 and Michael J. Sutcliffe1
1Department of Chemistry, University of Leicester, Leicester, LE1 7RH, UK.
2Oxford Molecular Ltd., The Medawar Centre, Oxford, OX4 4GA, UK.
Key words: NMR spectroscopy/protein structure/cluster analysis/multiple conformations/automated clustering
Corresponding Author:
Michael J Sutcliffe
Department of Chemistry , University of Leicester,
Leicester, LE1 7RH, UK.
Phone: (+44) [0]116 252 3054
Fax: (+44) [0]116 252 3789
Running Title:
Automated clustering of NMR ensembles.
[Protein Engineering 9, 1063-1065(1996)]
Introduction
Unlike structures determined by X-ray crystallography, which are deposited in
the Brookhaven Protein Data Bank (Abola et al., 1987)
as a single structure,
each NMR-derived structure is often deposited as an ensemble containing many
structures, each consistent with the restraint set used. However, there is often
a need to select a single "representative" structure, or a "representative"
subset of structures, from such an ensemble. This is useful, for example, in
the case of homology modelling or when compiling a relational database of
protein structures. It has been shown that cluster analysis, based on
overall fold, followed by selection of the structure closest to the centroid
of the largest cluster is likely to identify a structure more representative
of the ensemble than the commonly used minimised average structure
(Sutcliffe, 1993).
Two approaches to the problem of clustering ensembles of NMR-derived structures
have been described. One of these
(Adzhubei et al., 1995) performs pairwise
superposition of all structures using C( atoms to generate a set of RMS
distances. After cluster analysis based on these distances, a user-defined
cut-off is required to determine the final membership of clusters, and
therefore the representative structures. The other approach
(Diamond, 1995)
uses collective superpositions and rigid body transformations. Again, the
position at which to draw a cut-off based on the particular clustering pattern
was not addressed.
Whenever fixed values are used for the cut-off in clustering, there is a danger
of missing "true" clusters under the threshold imposed by the rigid cut-off
value. Considering the highly diverse nature of NMR-derived ensembles of
proteins, it would seem most appropriate to avoid the use of pre-defined
values for determining clusters. In fact, of the 302 ensembles we have
studied, the average pairwise RMS across an ensemble varied from 0.29Å to
11.3Å (mean value = 3.0±1.9Å). In this communication, we present an automated
method for cut-off determination which avoids the dangers of using fixed
values for this purpose.
We have developed a computer program which automatically, systematically
and rapidly (1) clusters an ensemble of structures into a set of
conformationally-related subfamilies, and (2) selects a representative
structure from each cluster. The program uses the method of average
linkage to define how clusters are built up, followed by application
of a penalty function which seeks to simultaneously minimise (1) the
number of clusters and (2) the spread across each cluster. This program,
known as NMRCLUST.
Materials and Methods
An overview of the method is given in Figure 1.
Figure 1.Flow chart illustrating the progress of the NMRCLUST algorithm.
Step 1: Distance Determination
Clustering requires a set of "distances" between members of an ensemble.
When a PDB file containing an ensemble of structures is used as input,
NMRCLUST derives these distances by superposing each member of the ensemble
onto each of the other members of the ensemble in a pairwise manner
(McLachlan,
1982), and the corresponding RMS value determined. This superposition is
carried out, by default, on all non-hydrogen atoms, or alternatively on a
user-defined set of atoms (See Flexibility of Input section).
For an ensemble with N members, this results in an NxN matrix of RMS values.
NMRCLUST can also accept a pre-determined matrix of "distances" as input.
This is useful in cases where objects other than protein structures are to be clustered.
Step 2: Clustering
This distance matrix is used with the average linkage algorithm for hierarchical
cluster analysis. The method of average linkage takes the distance between two clusters
m and n to be:

where cluster m contains X members, and cluster n contains Y
members; dist(i,j) is the RMS distance between the two members,
i and j, of clusters m and n, respectively,
after superposition. At each stage of the clustering algorithm, a search
is performed for the two nearest clusters; these are merged to form a single
cluster.
At each stage of clustering, the "spread" of each cluster is calculated.
The spread of a cluster m containing N members is given by:

where i and
k are members of cluster m. The Average Spread is then given by:

where cnumi is the number of clusters at stage i of the
clustering (excluding outlying points, i.e. clusters which contain only one
member).
Step 3: Normalisation of Average Spread
Once clustering is complete, the set of AvSpi values are
normalised to lie within the range 1 to N-1, where N is
the number of structures in the original data set. Normalisation is performed
to give equal weight in the penalty function (Step 4) to the number of clusters
and the average spread (a choice of relative weights which appears to work well).
where Max(AvSp)
and Min(AvSp) are the maximum and minimum values respectively of AvSp
in the set {AvSp1,AvSp2,...AvSpN-1}.
Step 4: Penalty Function
For each stage of clustering, i, a penalty value , Pi ,
can now be calculated as:

where nclusi is the total number of clusters at step i of the
clustering (including outlying points).
Step 5: Defining Cut-off
The minimum penalty value in the set {P1, P2, ...PN-1 }
is chosen as the cut-off level.

Thus, the stage icut represents a state where the clusters are as highly
populated as possible, whilst simultaneously maintaining the smallest spread.
The smaller the spread of the clusters, the more similar the conformations of its
members; the greater the population of a cluster, the less likely the chance
of excluding a member of similar conformation.
Step 6: Representative Structures
Once a cut-off in the clustering has been determined in this way, eigen analysis
(Sutcliffe, 1993) is performed on each cluster at stage icut. This allows the
determination of the structure within each cluster that is closest to the
centroid of that cluster.
Example Application
To illustrate the performance of the program, we present its application to
hirudin (Folkers et al., 1989); (Protein Data Bank accession number 4HIR).
In this example all non-hydrogen atoms were used for the superposition.
The penalty function arrives at a unique minimum value (Figure 2) which is
chosen as the cut-off point for the clustering.

Figure 2. Progress of the penalty function during clustering of ensemble 4HIR (Folkers et al., 1989).
The minimum value of the penalty function is chosen as the clustering cut-off point as indicated.
It is interesting to note the correlation between the clusters and the conformation of hirudin (Figure 3).
Figure 3. The superimposed
backbones of twenty-eight hirudin (4HIR) structures. This illustrates the
different conformations in the variable loop region from residues Ser 32 to
Glu 35 (encircled). The application of NMRCLUST to this ensemble, with
superposition on all non-hydrogen atoms, resulted in the four clusters shown,
and two outlying structures (not included for purposes of clarity). Different
colours indicate different cluster membership. [Note: Model 18 has been omitted from this analysis due to missing sidechain atoms on Gln 49.]
The four major clusters (i.e. excluding the two clusters containing only one
member) correspond to different conformations of the region of the structure
between residues Ser 32 and Glu 35. This observed lack of conformational order
is consistent with the absence of any long-range NOEs between this exposed
"finger" and the core region of the protein
(Folkers et al., 1989), as well as
the alternative hydrogen bonding patterns known to exist in this region
(Guntert et al., 1995).
Flexibility of Input
In addition to the automatically selected cut-off point, the program is able to
accept a user-definable value for the minimum distance between representative
structures. Also, NMRCLUST can superpose the structures within the ensemble on
the basis of a user-defined set of atoms. These are specified by the
residue-residue:atom,atom syntax. For example, to superpose on all carbon
atoms from residues 1 to 31, and residues 36 to 49, the syntax would be:
" 1-31, 36-49:C* ". The atoms to be used for superposition may be determined,
for instance, by using PROCHECK-NMR, the NMR version of the
PROCHECK program
(Laskowski et al., 1993). This will allow the user to determine those
residues that seem to be best defined in terms of backbone RMS, side-chain
RMS and/or dihedral angle variability. Alternatively the user could use the
technique described by Billeter (Billeter, 1992) which uses both backbone
RMS and all heavy atom RMS. (We have developed an automatic method
for determining the optimum set of atoms to be used for superposition,
NMRCORE.)
There is also the capability of performing the cluster analysis on a different,
user-defined, set of atoms to those used for superposition
(e.g. Modi et al., 1996).
Discussion
In this study, the decision to use the average linkage algorithm was based on
an assessment of the value of Min(P) produced on 196 NMR derived ensembles
(available November 1994) using three different clustering algorithms: single linkage, complete linkage, and average linkage. Of these three methods, average linkage performed best, producing the lowest average penalty value over the 196 ensembles. Another clustering algorithm commonly used with protein structures is the Jarvis-Patrick method (Allen and Doyle, 1991). However, this technique was not used in our studies as it requires a high level of user intervention: user-defined values for both the number of shared neighbours that two objects must possess in order to be in the same cluster (the commonality threshold, CJP), and the number of nearest neighbours being considered for each cluster (KJP) .
A criticism has been raised against the technique described herein -
the use of pairwise superposition followed by eigen analysis can lead
to negative eigenvalues and hence information loss or distortion
(Diamond, 1995). However, after running NMRCLUST on all 302 NMR-derived
ensembles available in November 1995, no distortion of information
above 10-6Å (by comparing every distance in N-1 dimensions to the
original distance matrix) was found. Consequently negative eigenvalues
do not seem in practice to be of particular concern. However, should
a distortion of information occur that exceeds 10-5Å, the program
warns the user, and instead of determining the structure closest to the centroid of the cluster, the program selects the structure with the minimum average RMS from all other cluster members (Adzhubei et al.,1995). (The results of applying NMRCLUST to the 302 NMR-
derived ensembles will be presented separately in a future paper.)
Conclusion
This method can be used to automatically cluster any data set (e.g. an ensemble
of NMR-derived structures or an ensemble of homology models) rapidly and
consistently, without the need for subjectively defined cut-offs. NMRCLUST
will take a file in PDB format containing an ensemble of structures, and
output the most representative structure from each of the resulting clusters.
These representative structures can subsequently be used, for example,
in homology modelling. Alternatively, NMRCLUST can take a pre-determined
matrix of "distances" and automatically output the resulting clusters and
their representative members.
Acknowledgements
We thank Roman Laskowski and Janet Thornton for useful discussions.
LAK is supported by a BBSRC CASE studentship, sponsored by
Oxford Molecular Ltd.
MJS is a Royal Society University Research Fellow.