 |
IPI - International Protein Index - IPI History File Format
We make every effort to maintain stable IPI identifiers and to propagate these between releases.
However, IPI is built
from multiple data sources, many of which are themselves unstable: this unstability is partially
reflected in IPI.
IPI history files (e.g. ipi.HUMAN.history.gz) provide information about the creation and deletion
of IPI IDs;
they also provide successor IDs for entries that have become secondary, and the reasons for the
deletion of IDs that have become invalid.
IPI history files can be downloaded for the current release from the
IPI FTP site.
Each line in the history file represents one IPI ID, which are ordered with the most recently created
IDs first.
The file is tab-delineated, and consists of the following fields:
- IPI ID
- Release version when ID was created
- Release version when ID was deleted, if available or '-' if not
- Successor ID, if available or '-' if not
- Comments, if available or '-' if not. These comments can be of the following types:
- Propagated means that the deleted ID has been propagated to another IPI entry
(defined in field
#4) as a secondary accession number. For more details on IPI identifier propagation see
here.
- Master (P) defunct means that the master source database entry (identified by
its accession number
P) of the IPI entry with the deleted ID was deleted in the source database, and that this IPI ID
could not be
propagated to any successor entry in the next IPI release. This can happen at high frequency
when gene
prediction alogorithms used by source databases to predict protein sequences are signigificantly
revised.
- Master (P) now invalid means that master P of former IPI cluster is still alive in
the source
database but is no longer used in the construction of IPI. This happens usually as a consequence
of an
annotation update (e.g. if a UniProt curator realizes that an entry is wrongly assigned to Human,
and change
its species to some kind of virus).
- Source entry (P) defunct This particular use is applied when an IPI entry whose
master was from a
supplementary database previously mapped to an entry from a non-supplementary database as well,
but does not
map to such an entry in the latest release. Entries from some source databases (considered
'supplementary')
are only considered for inclusion in IPI only if they match to an entry from another source
database, or if
they map to a known gene. These entries can be chosen as the masters of IPI entries. However,
such an IPI
entry will be deleted, even if the supplementary entry continues to exist, if the supplementary
entry no
longer meets these criteria for inclusion in a subsequent IPI release). Click
here for more details about supplementary data sets.
- Mapping to known gene now invalid was used when an IPI entry was previously created
although it was linked only to entries from supplementary databases, because it could also be mapped to a
known gene. However, the link to a known gene has not been confirmed in a subsequent release, leading to the
deletion of the IPI entry. Following changes in the way supplementary data sets are dealt with (see
here for details), this comment will not apply to IPI
entries which are
removed from May 2006 releases on. Instead, the more appropriate following comment will be used:
- Unsupported hypothetical protein is used when an IPI entry was previously created
although it was linked only to entries from supplementary databases, because the master entry
had support
for its validity (as explained here). However, this support
has not been
confirmed in a subsequent release, leading to the deletion of the IPI entry.
- Source entry (P) now invalid is applied in cases where an entry has been dropped
from IPI because
the entries from non-supplementary source databases previously mapped to this IPI entry are no
longer used in
the construction of IPI (as in case 4).
- Identified as putative MHC allele means that an IPI entry from a previous release is
now
identified as a MHC allele and excluded from IPI final data sets.
- Master (P) rejected as short fragmentory sequence means that an IPI entry from a
previous release
is now identified as a fragmentory sequence shorter than 100AA and thus excluded from IPI final
data sets.
- Master (P) lost #, under investigation or Under investigation simply means
that these cases
are not yet supported and are beeing investigated. We hope to be able to provide more
information about the
fate of these IPI entries in subsequent releases.
 |