![]() |
CluSTr DocumentationPublicationsThe predictive power of the CluSTr database.
Bioinformatics. 2005 Jun 16
abstract
full-text PDF
"CluSTr: a database of clusters of SWISS-PROT+TrEMBL proteins."
Nucleic Acids Res 2001 Jan 1;29(1):33-36
abstract
full-text HTML
"Proteome Analysis Database: online application of InterPro and CluSTr for the functional classification of proteins in whole genomes."
Nucleic Acids Res. 29(1):44-48
abstract
full-text HTML
"Clustering and analysis of protein families."
Curr Opin Struct Biol. 2001 Jun;11(3):334-339. Review.
abstract
MethodologyThe clustering approach is based on two steps.
Secondly, clusters are built using a single linkage algorithm for different levels of protein similarity. Only clusters which contain more than one protein are presented in the database. Fast ParAlign Smith-Waterman implementation is run to obtain the initial similarity matrix, whereas the statistical analysis is performed inside CluSTr's Oracle database. An in-house implementation of single-linkage clustering is then used for the clustering stage. The CluSTr data is stored in a relational database (Oracle). This allows us to handle large amounts of data and to facilitate comprehensive data updates. Multiple users have direct access to the database via Java servlets. The main building blocks of the schema are Proteins, Groups, Similarities and Clusters. The Proteins table describes UniProt Knowledgebase entries, Groups describes protein sets for which clusters were built and the history of comparison runs, Similarities contains the pairwise scores between proteins and Clusters table represents the information about and relationships between different clusters.
The data update is another big challenge in the design and implementation of the CluSTr database. Our aim is to update CluSTr data incrementally in a synchronised manner with bi-weekly updates of UniProt Knowledgebase. Additional Oracle tables are used for loading new sequences and their UniProt and IPI annotation. The set of new sequences is determined by cross-referencing of new accession numbers within the set of proteomes covered in CluSTr with UniParc. The non-redundancy of UniParc ensures that any sequence is only ever considered once. Once the new set of sequences is identified, 'new against new' and 'new against current' similarity calculations are performed. The next step of the update process is the single-linkage clustering run (also using dedicated additional Oracle tables). Finally, all live clustering- and annotation-related tables are updated in one fell swoop, thus maintaining internal CluSTr database consistency.
Available Clusters and GroupsNote that for reasons of practicality, only a subset of all clusters in CluSTr is available via the web interface. We refer to this subset as CluSTr Slim. The rules for selecting clusters for inclusion in CluSTr Slim have been decided after consultation with internal customers of CluSTr within EBI. The most recent definition of which clusters are excluded from CluSTr Slim is as follows:
CluSTr contains clusters for multi-specie groups 'Human and Mouse' and 'All against All'. In Addition, clusters for all the organisms with completely sequenced genomes are available. For the full list of the genomes see Integr8. User support and feedbackWe welcome feedback, particularly if you find errors or omissions please let us know. If you need information or help, have any comments and/or suggestions on the CluSTr database, please contact us at EBI Support.![]() |