 |
IPI - Algorithm
Construction of IPI
Introduction
IPI is produced automatically by mapping between the different
datasets on the basis of protein similarity. The set is assembled by a
combination of 1:1 reciprocal best matching between entries from
different databases, and 1:m sub-fragment matching between (and
possibly also within) different databases. The key questions are thus:
- How to perform matching between a pair of databases and
- How to combine the results of each set of mappings into one
set.
Various approaches have been tried out and the one described below
appears to produce the best results and has been used to create IPI.
Outline Approach
- Sequence sets are downloaded for all proteins described by
Ensembl and RefSeq for each genome in IPI. A third set is made by
taking all entries from this species in UniProtKB, and expanding the
set of sequences explicitly displayed in these by using the
information in the entries' feature tables to generate sequences for
each known isoform (see this document for more details).
- Pairwise inter-database similarity searches are performed.
- Match length is determined by the maximal aggregation of
non-overlapping regions of sequence identity, subject to the
requirement that the actually matching regions must represent at
least 95% of the match span.
- For each database pair, all reciprocally best-matching
protein pairs are identified (i.e. all X-1's such X-1 is itself the
best best match in database X of Y-2, where Y-2 is X-1's best match
in database Y).
- Reciprocally best matching pairs are combined into cluster
of mapped proteins.
- Orphan proteins (that have not reciprocally best matched to
any other protein) are added to the appropriate clusters if they are
sub-fragments of entries that have reciprocally best matched.
- An entry is considered a sub-fragment of another entry if
over 95% of its length is covered by the span of its optimal
aggregation of matches to the other sequence.
- At both of these stages conflict resolution is performed
where necessary to ensure that each set of matched proteins contains
no more than one entry from UniProtKB/Swiss-Prot or RefSeq. Multiple
UniProtKB/TrEMBL and H-InvDB entries are allowed in a single set.
Multiple Ensembl entries are allowed to map to 1 IPI entry, but only
after sub-fragment matching.
- All UniProtKB/Swiss-Prot, RefSeq or Ensembl proteins that
are still unmatched are then added as singletons to the list of
protein clusters
- UniProtKB/TrEMBL is known to contain some redundancy.
Therefore, additional clustering of UniProtKB/TrEMBL entries (based
on further sequence similarity comparisons) is performed, before the
addition of UniProtKB/TrEMBL entries not matched to entries from
other databases to the list of all protein clusters
- Each cluster of mapped entries corresponds to a single entry
in IPI.
- A sequence is attached to each IPI entry, taken from one of
the constituent databases (according to the hierarchy:
UniProtKB/Swiss-Prot, RefSeq, UniProtKB/TrEMBL, Ensembl, with the
proviso that the sequence of identified sub fragments is never used
instead of a non-sub fragment sequence).
Identifier propagation
Identifiers are propagated between releases of IPI according to
the following rules:
- If the master source database entry defining IPI entry X' in
a new release was also the master of an IPI entry X in the previous
release, then X' is assigned the identifier of X.
- Failing this, if the master source database entry defining
IPI entry X' in a new release was referred to by an IPI entry Y in
the previous release, but was not the master of Y, X' can still be
assigned the identifier of Y in certain circumstances. All source
database entries referred to by Y are ranked (master, 2nd choice, 3rd
choice, etc.); and have claim on the identifier of Y in rank order
(i.e. if Y's master is also the master of a new IPI entry Y', then Y'
will be preferentially assigned the identifier of Y; but if Y's
master is not a master in the new IPI release, and the 2nd ranked
entry referred to by Y is the master of IPI entry X', then X' can be
assigned the identifier of Y).
- If the master source database entry defining X' was not
included in the previous IPI release, or if it was attached to IPI
entry Y in the previous release but does not have first claim on the
identifier of Y, X' may be assigned the identifier of another IPI
entry from the previous release, Z, in some circumstances. If none of
the IPI entries in the new release have claimed the identifier of Z
according to either of the two rules above, and if X', an IPI entry
in the new release, has not been assigned an identifier at all
according to the same rules, then the identifier of Z is assigned to
X' if their sequences are identical.
- The third rule has been introduced for the IPI releases of
May 2003, in order to avoid the recurrence of certain problems with
identifier stability that were manifest in the releases of April
2003.
- Note that in some circumstances it is still possible for an
IPI entry in the new release to have the same sequence as an IPI
entry in a previous release, but a different identifier. For example,
suppose two source database entries A and B were both referred to by
IPI entry X (in the old IPI release), and A was the master. In the
new release, if the sequence of A changes, and A and B now represent
different IPI entries (and both are masters), the new IPI entry
referring to A will retain the identifier of X (and the sequence
version of this identifier will be incremented accordingly); while
the new IPI entry referring to B will be assigned a new identifier.
However, this is a relatively rare event (affecting 42 sequences, for
example, in the transition from IPI HUMAN v2.18 to IPI HUMAN v2.19).
Secondary Identifiers
- In IPI, we try to achieve identifier stability. However, on
occasion IPI identifiers are lost although an entry from a source
database continues to exist. For example, if source database entries
X and Y are originally assigned to different IPI entries, but after a
sequence change to entry X, X and Y are thereafter assigned to the
same IPI entry, then necessarily, one of the IPI IDs previously used
has become redundant.
- Therefore, we have introduced secondary identifiers into
IPI. Secondary identifiers can now be found in the AC line of IPI
entries if downloaded in UniProt format. The first number on this
line is the current IPI identifier; the other numbers are obsolete
identifiers that now map to the current primary ID. Each obsolete
identifier will appear at most in only one valid entry in a single
IPI release.
- Secondary identifiers are assigned as follows:
- If an entry is deleted from IPI, but the identifier of one
of the source database entries that mapped to it is still in a
subsequent release, then the deleted ID becomes secondary to the ID
of the IPI entry now associated with that source database entry.
- If more than one source entry that previously mapped to an
obsolete IPI entry is referenced in the next release of IPI, an
entry is selected to determine the new primary ID according to a
preference hierarchy of the source databases (UniProtKB/Swiss-Prot,
RefSeq, UniProtKB/TrEMBL, Ensembl). If there are two entries from
the same database that both meet this criteria, the entry with the
longest sequence is preferred.
- If two UniProtKB entries are merged, one UniProtKB
identifier becomes secondary in UniProtKB. This identifier will no
longer appear in IPI. However, if this secondary UniProtKB
identifier was previously mapped to a different IPI entry to that of
its new master in UniProtKB, then this IPI identifier (now obsolete)
can be identified as secondary to the IPI entry now associated with
the master UniProtKB entry.
- Additional sequence checks are carried out where one
UniProtKB entry is represented several times in IPI (once per
isoform) to ensure that identifiers are propagated to the correct
entry only.
Curation
IPI is produced by automatic mapping.
However, if you wish to inform us that certain associations are
incorrect we will take this into account in preparing future versions
of IPI.
Please contact us with
any information.
Algorithm modifications for IPI v3.x
The algorithm for IPI v3.x is essentially still as described in Proteomics 4. With the release of v3.x, some
minor changes have been made to the algorithm. These are summarised
below:
- New rule introduced to restrict protein merging (1): two
proteins with non-identical sequences can only be merged if there is
no evidence (from UniProtKB, Ensembl, Entrez Gene or the model
organism databases) that they are encoded by different genes.
- New rule introduced to restrict protein merging (2): two
proteins with non-identical sequences cannot be merged if they come
from the same non-redundant source database. A non-redundant source
database is classified as such either because it is curated to a high
standard, or because it represents a non-redundant set of predictions
from a complete genome assembly.
- During the clustering process, CCDS IDs are used to disallow
protein merge when they have a different sequence and cross reference
a different CCDS ID.
The changes have been introduced to support the aims of IPI (to
provide proteins sets containing 1 protein per non-identical isoform),
and to enable us to move towards providing complimentary gene sets.
Supplementary data sets
With the releases of May 2006, the way supplementary data sets
are filtered has been slightly modified to fit the changes in some of
IPI's data sources. Are now filtered out from IPI, clusters which
members are hypothetical proteins from a supplementary set and for
which there are no clear support for validity.
One hypothetical protein is considered to be unsupported when:
- it is not believed to be similar to any known protein (from
the same species or not)
- no InterPro domain was found from its sequence
- it is a pseudogene candidate.
New rules to classify a data set as supplementary include the
fact that more than 50% of this set can not be supported by any other
data set used in IPI. These are now:
- The H-Invitational Database (H-InvDB )
representative set.
- The Reference Sequence Project (RefSeq) ab
initio predictions identified by the curation
status MODEL.
With the releases of February
2005, the concept of the supplementary data set has been introduced
into IPI. A supplementary data set corresponds to a data set whose
entries are only cross-referenced by IPI only if sequence similarities
or gene mappings have been found between them and entries from the
primary data sets used to build IPI (either protein data sets,
UniProtKB, Ensembl, RefSeq,... or gene data sets, Entrez Gene, HUGO,
MGI,...).
Data sets are classified as supplementary to avoid inflating
the size of IPI, especially with data that might now be out-of-date. A
typical supplementary data set might be a data set produced by a
one-off annotation project.
Supplementary data sets
currently used by IPI:
- The H-Invitational Database (H-InvDB)
provides a protein data set based on clusters of cDNAs mapped back to
the Human genome. The resulting proteins can be assimilated to
alternative predictions of ranscripts of identified genes in the
human genome.
 |