spacer
spacer

IPI - Algorithm

Construction of IPI

Introduction

IPI is produced automatically by mapping between the different datasets on the basis of protein similarity. The set is assembled by a combination of 1:1 reciprocal best matching between entries from different databases, and 1:m sub-fragment matching between (and possibly also within) different databases. The key questions are thus:
  • How to perform matching between a pair of databases and

  • How to combine the results of each set of mappings into one set.

Various approaches have been tried out and the one described below appears to produce the best results and has been used to create IPI.

Outline Approach

  1. Sequence sets are downloaded for all proteins described by Ensembl and RefSeq for each genome in IPI. A third set is made by taking all entries from this species in UniProtKB, and expanding the set of sequences explicitly displayed in these by using the information in the entries' feature tables to generate sequences for each known isoform (see this document for more details).

  2. Pairwise inter-database similarity searches are performed.

  3. Match length is determined by the maximal aggregation of non-overlapping regions of sequence identity, subject to the requirement that the actually matching regions must represent at least 95% of the match span.

  4. For each database pair, all reciprocally best-matching protein pairs are identified (i.e. all X-1's such X-1 is itself the best best match in database X of Y-2, where Y-2 is X-1's best match in database Y).

  5. Reciprocally best matching pairs are combined into cluster of mapped proteins.

  6. Orphan proteins (that have not reciprocally best matched to any other protein) are added to the appropriate clusters if they are sub-fragments of entries that have reciprocally best matched.

  7. An entry is considered a sub-fragment of another entry if over 95% of its length is covered by the span of its optimal aggregation of matches to the other sequence.

  8. At both of these stages conflict resolution is performed where necessary to ensure that each set of matched proteins contains no more than one entry from UniProtKB/Swiss-Prot or RefSeq. Multiple UniProtKB/TrEMBL and H-InvDB entries are allowed in a single set. Multiple Ensembl entries are allowed to map to 1 IPI entry, but only after sub-fragment matching.

  9. All UniProtKB/Swiss-Prot, RefSeq or Ensembl proteins that are still unmatched are then added as singletons to the list of protein clusters

  10. UniProtKB/TrEMBL is known to contain some redundancy. Therefore, additional clustering of UniProtKB/TrEMBL entries (based on further sequence similarity comparisons) is performed, before the addition of UniProtKB/TrEMBL entries not matched to entries from other databases to the list of all protein clusters

  11. Each cluster of mapped entries corresponds to a single entry in IPI.

  12. A sequence is attached to each IPI entry, taken from one of the constituent databases (according to the hierarchy: UniProtKB/Swiss-Prot, RefSeq, UniProtKB/TrEMBL, Ensembl, with the proviso that the sequence of identified sub fragments is never used instead of a non-sub fragment sequence).

Identifier propagation

Identifiers are propagated between releases of IPI according to the following rules:

  1. If the master source database entry defining IPI entry X' in a new release was also the master of an IPI entry X in the previous release, then X' is assigned the identifier of X.

  2. Failing this, if the master source database entry defining IPI entry X' in a new release was referred to by an IPI entry Y in the previous release, but was not the master of Y, X' can still be assigned the identifier of Y in certain circumstances. All source database entries referred to by Y are ranked (master, 2nd choice, 3rd choice, etc.); and have claim on the identifier of Y in rank order (i.e. if Y's master is also the master of a new IPI entry Y', then Y' will be preferentially assigned the identifier of Y; but if Y's master is not a master in the new IPI release, and the 2nd ranked entry referred to by Y is the master of IPI entry X', then X' can be assigned the identifier of Y).

  3. If the master source database entry defining X' was not included in the previous IPI release, or if it was attached to IPI entry Y in the previous release but does not have first claim on the identifier of Y, X' may be assigned the identifier of another IPI entry from the previous release, Z, in some circumstances. If none of the IPI entries in the new release have claimed the identifier of Z according to either of the two rules above, and if X', an IPI entry in the new release, has not been assigned an identifier at all according to the same rules, then the identifier of Z is assigned to X' if their sequences are identical.

  4. The third rule has been introduced for the IPI releases of May 2003, in order to avoid the recurrence of certain problems with identifier stability that were manifest in the releases of April 2003.

  5. Note that in some circumstances it is still possible for an IPI entry in the new release to have the same sequence as an IPI entry in a previous release, but a different identifier. For example, suppose two source database entries A and B were both referred to by IPI entry X (in the old IPI release), and A was the master. In the new release, if the sequence of A changes, and A and B now represent different IPI entries (and both are masters), the new IPI entry referring to A will retain the identifier of X (and the sequence version of this identifier will be incremented accordingly); while the new IPI entry referring to B will be assigned a new identifier. However, this is a relatively rare event (affecting 42 sequences, for example, in the transition from IPI HUMAN v2.18 to IPI HUMAN v2.19).

Secondary Identifiers

  • In IPI, we try to achieve identifier stability. However, on occasion IPI identifiers are lost although an entry from a source database continues to exist. For example, if source database entries X and Y are originally assigned to different IPI entries, but after a sequence change to entry X, X and Y are thereafter assigned to the same IPI entry, then necessarily, one of the IPI IDs previously used has become redundant.

  • Therefore, we have introduced secondary identifiers into IPI. Secondary identifiers can now be found in the AC line of IPI entries if downloaded in UniProt format. The first number on this line is the current IPI identifier; the other numbers are obsolete identifiers that now map to the current primary ID. Each obsolete identifier will appear at most in only one valid entry in a single IPI release.

  • Secondary identifiers are assigned as follows:

    1. If an entry is deleted from IPI, but the identifier of one of the source database entries that mapped to it is still in a subsequent release, then the deleted ID becomes secondary to the ID of the IPI entry now associated with that source database entry.

    2. If more than one source entry that previously mapped to an obsolete IPI entry is referenced in the next release of IPI, an entry is selected to determine the new primary ID according to a preference hierarchy of the source databases (UniProtKB/Swiss-Prot, RefSeq, UniProtKB/TrEMBL, Ensembl). If there are two entries from the same database that both meet this criteria, the entry with the longest sequence is preferred.

    3. If two UniProtKB entries are merged, one UniProtKB identifier becomes secondary in UniProtKB. This identifier will no longer appear in IPI. However, if this secondary UniProtKB identifier was previously mapped to a different IPI entry to that of its new master in UniProtKB, then this IPI identifier (now obsolete) can be identified as secondary to the IPI entry now associated with the master UniProtKB entry.

    4. Additional sequence checks are carried out where one UniProtKB entry is represented several times in IPI (once per isoform) to ensure that identifiers are propagated to the correct entry only.

Curation

IPI is produced by automatic mapping. However, if you wish to inform us that certain associations are incorrect we will take this into account in preparing future versions of IPI. Please contact us with any information.

Algorithm modifications for IPI v3.x


The algorithm for IPI v3.x is essentially still as described in
Proteomics 4. With the release of v3.x, some minor changes have been made to the algorithm. These are summarised below:

  • New rule introduced to restrict protein merging (1): two proteins with non-identical sequences can only be merged if there is no evidence (from UniProtKB, Ensembl, Entrez Gene or the model organism databases) that they are encoded by different genes.

  • New rule introduced to restrict protein merging (2): two proteins with non-identical sequences cannot be merged if they come from the same non-redundant source database. A non-redundant source database is classified as such either because it is curated to a high standard, or because it represents a non-redundant set of predictions from a complete genome assembly.

  • During the clustering process, CCDS IDs are used to disallow protein merge when they have a different sequence and cross reference a different CCDS ID.
The changes have been introduced to support the aims of IPI (to provide proteins sets containing 1 protein per non-identical isoform), and to enable us to move towards providing complimentary gene sets.

Supplementary data sets

With the releases of May 2006, the way supplementary data sets are filtered has been slightly modified to fit the changes in some of IPI's data sources. Are now filtered out from IPI, clusters which members are hypothetical proteins from a supplementary set and for which there are no clear support for validity.

One hypothetical protein is considered to be unsupported when:

  • it is not believed to be similar to any known protein (from the same species or not)
  • no InterPro domain was found from its sequence
  • it is a pseudogene candidate.

New rules to classify a data set as supplementary include the fact that more than 50% of this set can not be supported by any other data set used in IPI. These are now:

With the releases of February 2005, the concept of the supplementary data set has been introduced into IPI. A supplementary data set corresponds to a data set whose entries are only cross-referenced by IPI only if sequence similarities or gene mappings have been found between them and entries from the primary data sets used to build IPI (either protein data sets, UniProtKB, Ensembl, RefSeq,... or gene data sets, Entrez Gene, HUGO, MGI,...).

Data sets are classified as supplementary to avoid inflating the size of IPI, especially with data that might now be out-of-date. A typical supplementary data set might be a data set produced by a one-off annotation project.

Supplementary data sets currently used by IPI:

  • The H-Invitational Database (H-InvDB) provides a protein data set based on clusters of cDNAs mapped back to the Human genome. The resulting proteins can be assimilated to alternative predictions of ranscripts of identified genes in the human genome.















spacer
spacer