spacer
Collapse CluSTr
The CluSTr database offers an automatic classification of
UniProt Knowledgebase proteins into groups of related proteins.
Collapse UniProt
UniProt is a central database of protein sequence and function created by joining the information contained in Swiss-Prot, TrEMBL, and PIR.

spacer

CluSTr Documentation

Publications

1. Petryszak R., Kretschmann E., Wieser D., Apweiler R. (2005)
The predictive power of the CluSTr database.
Bioinformatics. 2005 Jun 16
abstract full-text PDF

2. Kriventseva E.V., Fleischmann W., Zdobnov E.M., Apweiler R. (2001)
"CluSTr: a database of clusters of SWISS-PROT+TrEMBL proteins."
Nucleic Acids Res 2001 Jan 1;29(1):33-36
abstract full-text HTML

3. Apweiler R., Biswas M., Fleischmann W., Kanapin A., Karavidopoulou Y., Kersey P., Kriventseva E.V., Mittard V., Mulder N., Phan I., Zdobnov E. (2001)
"Proteome Analysis Database: online application of InterPro and CluSTr for the functional classification of proteins in whole genomes."
Nucleic Acids Res. 29(1):44-48
abstract full-text HTML

4.Kriventseva EV, Biswas M, Apweiler R. (2001)
"Clustering and analysis of protein families."
Curr Opin Struct Biol. 2001 Jun;11(3):334-339. Review.
abstract

Methodology

The clustering approach is based on two steps.
Firstly, a similarity matrix of "all-against-all" protein sequences is built. The similarity matrix is computed using the Smith-Waterman algorithm, which returns two measures of similarity: the Smith-Waterman score and an E-value. The latter is based on a protein database size of 10 mln sequences. The statistical significance measure used in CluSTr is calculated using the following formula: Statistical Significance = -1 * log10(E-value).
For example, the statistical significance of 10 corresponds to an E-value of E-10. Thus, higher values of statistical significance correspond to better similarities. Note that the statistical significance of 10000 corresponds to the E-value of 0.

Secondly, clusters are built using a single linkage algorithm for different levels of protein similarity. Only clusters which contain more than one protein are presented in the database. Fast ParAlign Smith-Waterman implementation is run to obtain the initial similarity matrix, whereas the statistical analysis is performed inside CluSTr's Oracle database. An in-house implementation of single-linkage clustering is then used for the clustering stage.

The CluSTr data is stored in a relational database (Oracle). This allows us to handle large amounts of data and to facilitate comprehensive data updates. Multiple users have direct access to the database via Java servlets. The main building blocks of the schema are Proteins, Groups, Similarities and Clusters. The Proteins table describes UniProt Knowledgebase entries, Groups describes protein sets for which clusters were built and the history of comparison runs, Similarities contains the pairwise scores between proteins and Clusters table represents the information about and relationships between different clusters.

The data update is another big challenge in the design and implementation of the CluSTr database. Our aim is to update CluSTr data incrementally in a synchronised manner with bi-weekly updates of UniProt Knowledgebase. Additional Oracle tables are used for loading new sequences and their UniProt and IPI annotation.

The set of new sequences is determined by cross-referencing of new accession numbers within the set of proteomes covered in CluSTr with UniParc. The non-redundancy of UniParc ensures that any sequence is only ever considered once. Once the new set of sequences is identified, 'new against new' and 'new against current' similarity calculations are performed.

The next step of the update process is the single-linkage clustering run (also using dedicated additional Oracle tables). Finally, all live clustering- and annotation-related tables are updated in one fell swoop, thus maintaining internal CluSTr database consistency.

 

Available Clusters and Groups

Note that for reasons of practicality, only a subset of all clusters in CluSTr is available via the web interface. We refer to this subset as CluSTr Slim. The rules for selecting clusters for inclusion in CluSTr Slim have been decided after consultation with internal customers of CluSTr within EBI. The most recent definition of which clusters are excluded from CluSTr Slim is as follows:

  • All Singletons (these are not outliers in biological sense, but artefacts of the clustering process)
  • All clusters whose member set forms 90% or more of the member set of their respective parents.
    The rationale for this rule is that such clusters are too similar to their parent to add anything new (from biological point of view), to the information already provided by their parent.
  • All ultimate predecessors (i.e. clusters with no parents) whose size is greater than 1000.
    The rationale for this rule is that clusters of large sizes are unlikely to be specific enough to be worth considering. Viewing such clusters via CluSTr web interface is also not practicable.

CluSTr contains clusters for multi-specie groups 'Human and Mouse' and 'All against All'. In Addition, clusters for all the organisms with completely sequenced genomes are available. For the full list of the genomes see Integr8.


User support and feedback

We welcome feedback, particularly if you find errors or omissions please let us know. If you need information or help, have any comments and/or suggestions on the CluSTr database, please contact us at EBI Support.
spacer
spacer