An Automated Approach For Clustering An Ensemble Of NMR-Derived Protein Structures Into Conformationally-Related Subfamilies.

Lawrence A. Kelley1, Stephen P. Gardner2 and Michael J. Sutcliffe1

1Department of Chemistry, University of Leicester, Leicester, LE1 7RH, UK. 2Oxford Molecular Ltd., The Medawar Centre, Oxford, OX4 4GA, UK.

Key words: NMR spectroscopy/protein structure/cluster analysis/multiple conformations/automated clustering

Corresponding Author:
Michael J Sutcliffe
Department of Chemistry , University of Leicester,
Leicester, LE1 7RH, UK.
Phone: (+44) [0]116 252 3054
Fax: (+44) [0]116 252 3789

Running Title: Automated clustering of NMR ensembles.

[Protein Engineering 9, 1063-1065(1996)]


Introduction

Unlike structures determined by X-ray crystallography, which are deposited in the Brookhaven Protein Data Bank (Abola et al., 1987) as a single structure, each NMR-derived structure is often deposited as an ensemble containing many structures, each consistent with the restraint set used. However, there is often a need to select a single "representative" structure, or a "representative" subset of structures, from such an ensemble. This is useful, for example, in the case of homology modelling or when compiling a relational database of protein structures. It has been shown that cluster analysis, based on overall fold, followed by selection of the structure closest to the centroid of the largest cluster is likely to identify a structure more representative of the ensemble than the commonly used minimised average structure (Sutcliffe, 1993).

Two approaches to the problem of clustering ensembles of NMR-derived structures have been described. One of these (Adzhubei et al., 1995) performs pairwise superposition of all structures using C( atoms to generate a set of RMS distances. After cluster analysis based on these distances, a user-defined cut-off is required to determine the final membership of clusters, and therefore the representative structures. The other approach (Diamond, 1995) uses collective superpositions and rigid body transformations. Again, the position at which to draw a cut-off based on the particular clustering pattern was not addressed.

Whenever fixed values are used for the cut-off in clustering, there is a danger of missing "true" clusters under the threshold imposed by the rigid cut-off value. Considering the highly diverse nature of NMR-derived ensembles of proteins, it would seem most appropriate to avoid the use of pre-defined values for determining clusters. In fact, of the 302 ensembles we have studied, the average pairwise RMS across an ensemble varied from 0.29Å to 11.3Å (mean value = 3.0±1.9Å). In this communication, we present an automated method for cut-off determination which avoids the dangers of using fixed values for this purpose.

We have developed a computer program which automatically, systematically and rapidly (1) clusters an ensemble of structures into a set of conformationally-related subfamilies, and (2) selects a representative structure from each cluster. The program uses the method of average linkage to define how clusters are built up, followed by application of a penalty function which seeks to simultaneously minimise (1) the number of clusters and (2) the spread across each cluster. This program, known as NMRCLUST.


Materials and Methods

An overview of the method is given in Figure 1.

Figure 1.Flow chart illustrating the progress of the NMRCLUST algorithm.

Step 1: Distance Determination

Clustering requires a set of "distances" between members of an ensemble. When a PDB file containing an ensemble of structures is used as input, NMRCLUST derives these distances by superposing each member of the ensemble onto each of the other members of the ensemble in a pairwise manner (McLachlan, 1982), and the corresponding RMS value determined. This superposition is carried out, by default, on all non-hydrogen atoms, or alternatively on a user-defined set of atoms (See Flexibility of Input section). For an ensemble with N members, this results in an NxN matrix of RMS values. NMRCLUST can also accept a pre-determined matrix of "distances" as input. This is useful in cases where objects other than protein structures are to be clustered.

Step 2: Clustering

This distance matrix is used with the average linkage algorithm for hierarchical cluster analysis. The method of average linkage takes the distance between two clusters m and n to be:

where cluster m contains X members, and cluster n contains Y members; dist(i,j) is the RMS distance between the two members, i and j, of clusters m and n, respectively, after superposition. At each stage of the clustering algorithm, a search is performed for the two nearest clusters; these are merged to form a single cluster.

At each stage of clustering, the "spread" of each cluster is calculated. The spread of a cluster m containing N members is given by:



where i and k are members of cluster m. The Average Spread is then given by:

where cnumi is the number of clusters at stage i of the clustering (excluding outlying points, i.e. clusters which contain only one member).

Step 3: Normalisation of Average Spread

Once clustering is complete, the set of AvSpi values are normalised to lie within the range 1 to N-1, where N is the number of structures in the original data set. Normalisation is performed to give equal weight in the penalty function (Step 4) to the number of clusters and the average spread (a choice of relative weights which appears to work well).
where Max(AvSp) and Min(AvSp) are the maximum and minimum values respectively of AvSp in the set {AvSp1,AvSp2,...AvSpN-1}.

Step 4: Penalty Function

For each stage of clustering, i, a penalty value , Pi , can now be calculated as:

where nclusi is the total number of clusters at step i of the clustering (including outlying points).

Step 5: Defining Cut-off

The minimum penalty value in the set {P1, P2, ...PN-1 } is chosen as the cut-off level.

Thus, the stage icut represents a state where the clusters are as highly populated as possible, whilst simultaneously maintaining the smallest spread. The smaller the spread of the clusters, the more similar the conformations of its members; the greater the population of a cluster, the less likely the chance of excluding a member of similar conformation.

Step 6: Representative Structures

Once a cut-off in the clustering has been determined in this way, eigen analysis (Sutcliffe, 1993) is performed on each cluster at stage icut. This allows the determination of the structure within each cluster that is closest to the centroid of that cluster.

Example Application

To illustrate the performance of the program, we present its application to hirudin (Folkers et al., 1989); (Protein Data Bank accession number 4HIR). In this example all non-hydrogen atoms were used for the superposition. The penalty function arrives at a unique minimum value (Figure 2) which is chosen as the cut-off point for the clustering.

Figure 2. Progress of the penalty function during clustering of ensemble 4HIR (Folkers et al., 1989). The minimum value of the penalty function is chosen as the clustering cut-off point as indicated.


It is interesting to note the correlation between the clusters and the conformation of hirudin (Figure 3).


Figure 3. The superimposed backbones of twenty-eight hirudin (4HIR) structures. This illustrates the different conformations in the variable loop region from residues Ser 32 to Glu 35 (encircled). The application of NMRCLUST to this ensemble, with superposition on all non-hydrogen atoms, resulted in the four clusters shown, and two outlying structures (not included for purposes of clarity). Different colours indicate different cluster membership. [Note: Model 18 has been omitted from this analysis due to missing sidechain atoms on Gln 49.]

The four major clusters (i.e. excluding the two clusters containing only one member) correspond to different conformations of the region of the structure between residues Ser 32 and Glu 35. This observed lack of conformational order is consistent with the absence of any long-range NOEs between this exposed "finger" and the core region of the protein (Folkers et al., 1989), as well as the alternative hydrogen bonding patterns known to exist in this region (Guntert et al., 1995).

Flexibility of Input

In addition to the automatically selected cut-off point, the program is able to accept a user-definable value for the minimum distance between representative structures. Also, NMRCLUST can superpose the structures within the ensemble on the basis of a user-defined set of atoms. These are specified by the residue-residue:atom,atom syntax. For example, to superpose on all carbon atoms from residues 1 to 31, and residues 36 to 49, the syntax would be: " 1-31, 36-49:C* ". The atoms to be used for superposition may be determined, for instance, by using PROCHECK-NMR, the NMR version of the PROCHECK program (Laskowski et al., 1993). This will allow the user to determine those residues that seem to be best defined in terms of backbone RMS, side-chain RMS and/or dihedral angle variability. Alternatively the user could use the technique described by Billeter (Billeter, 1992) which uses both backbone RMS and all heavy atom RMS. (We have developed an automatic method for determining the optimum set of atoms to be used for superposition, NMRCORE.) There is also the capability of performing the cluster analysis on a different, user-defined, set of atoms to those used for superposition (e.g. Modi et al., 1996).

Discussion

In this study, the decision to use the average linkage algorithm was based on an assessment of the value of Min(P) produced on 196 NMR derived ensembles (available November 1994) using three different clustering algorithms: single linkage, complete linkage, and average linkage. Of these three methods, average linkage performed best, producing the lowest average penalty value over the 196 ensembles. Another clustering algorithm commonly used with protein structures is the Jarvis-Patrick method (Allen and Doyle, 1991). However, this technique was not used in our studies as it requires a high level of user intervention: user-defined values for both the number of shared neighbours that two objects must possess in order to be in the same cluster (the commonality threshold, CJP), and the number of nearest neighbours being considered for each cluster (KJP) . A criticism has been raised against the technique described herein - the use of pairwise superposition followed by eigen analysis can lead to negative eigenvalues and hence information loss or distortion (Diamond, 1995). However, after running NMRCLUST on all 302 NMR-derived ensembles available in November 1995, no distortion of information above 10-6Å (by comparing every distance in N-1 dimensions to the original distance matrix) was found. Consequently negative eigenvalues do not seem in practice to be of particular concern. However, should a distortion of information occur that exceeds 10-5Å, the program warns the user, and instead of determining the structure closest to the centroid of the cluster, the program selects the structure with the minimum average RMS from all other cluster members (Adzhubei et al.,1995). (The results of applying NMRCLUST to the 302 NMR- derived ensembles will be presented separately in a future paper.)

Conclusion

This method can be used to automatically cluster any data set (e.g. an ensemble of NMR-derived structures or an ensemble of homology models) rapidly and consistently, without the need for subjectively defined cut-offs. NMRCLUST will take a file in PDB format containing an ensemble of structures, and output the most representative structure from each of the resulting clusters. These representative structures can subsequently be used, for example, in homology modelling. Alternatively, NMRCLUST can take a pre-determined matrix of "distances" and automatically output the resulting clusters and their representative members.

Acknowledgements

We thank Roman Laskowski and Janet Thornton for useful discussions. LAK is supported by a BBSRC CASE studentship, sponsored by Oxford Molecular Ltd. MJS is a Royal Society University Research Fellow.