IPI - International Protein Index - Frequently Asked Questions
How can I identify the source of an IPI entry?
The source of each IPI entry can be identified by downloading the data set in UniProt (Swiss-Prot) format (download
the files ending .dat from the FTP server). In the cross-references section of each entry, the
cross-reference to the database entry that provides the sequence of the IPI entry is marked by the presence of the
letter 'M' in the fourth field.
For example
ID IPI00177321.1 IPI; PRT; 316 AA.
DR RefSeq/predicted; XP_168060; GI:22060273; M.
DR ENSEMBL; ENSP00000343431; ENSG00000189070; -.
In this entry, an Ensembl and a RefSeq entry have been merged into one IPI entry, and the RefSeq sequence has
been used.
What is the difference between IPI and UniProtKB?
UniProt (Universal Protein Resource) is a comprehensive catalog of information on proteins. It is a central
repository of protein sequence and function created by joining the information contained in Swiss-Prot, TrEMBL, and
PIR.
UniProt is comprised of three components, each optimised for different uses. The UniProt Knowledgebase
(UniProtKB) is the central access point for extensive curated protein information, including function,
classification, and cross-reference. The UniProt Non-redundant Reference (UniRef) databases combine closely related
sequences into a single record to speed searches. The UniProt Archive (UniParc) is a comprehensive repository,
reflecting the history of all protein sequences.
The UniProt Knowledgebase contains protein data from all species where it is available. This data includes
protein sequences determined by direct experiment and derived from the sequencing of individual DNA clones or RNA
molecules. It does not, however, necessarily include predictions of protein sequences derived from the complete
genome sequence of every organism where this has been determined. This is particularly an issue in higher
eukaryotes. Methods for protein prediction in these species are still undergoing improvement and the predictions of
groups (such as Ensembl and RefSeq) derived using these methods therefore manifest some instability. Additionally,
some years before the sequence of an organism is completed, a preliminary assembly of its genome may become
available, from which is it possible to make provisional protein predictions that will subsequently need revision.
For these reasons, protein predictions in these species are often not submitted to the EMBL/Genbank/DDBJ nucleotide
sequence databases, and do not appear in the UniProt Knowledgebase.
IPI protein sets are made for a limited number of higher eukaryotic species whose genomic sequence has been
completely determined but where there are a large number of predicted protein sequences that are not yet in UniProt.
IPI takes data from UniProt and also from sources of such predictions, and combines them non-redundantly into a
comprehensive proteome set for each species.
What is the
difference between IPI and UniParc?
UniParc (the UniProt archive) is a database of protein sequences. Every UniParc Identifier
(UPI) is unique and stable for a particular sequence.
IPI is a database of annotated proteins. Thus if the sequence associated with a particular IPI entry changes,
the IPI ID associated with it will usually remain the same. Conversely, it is possible for the same sequence to have
two different IPI IDs, if that sequence is associated with different source database entries in different releases.
The following example illustrates what would happen if, between two IPI releases, the sequence of source
database entry A1 changed from AAA to PPP and the sequence of source database entry A2
changed from MMM to AAA. IPI IDs would remain stably associated with A1 and A2;
A2 would acquire the UPI previously assigned to A1; and A1 would get a new UPI:
IPI contains cross-references to UniParc (in the .dat and .xrefs files), and the two resources
can be used in conjunction.
What is the difference between IPI and UniRef 100?
IPI is included in NRef100. NRef100 also includes certain other sequences
(from the UniProtKB, in particular, and also from some other sources)
that are distinct from sequences in IPI, but which the IPI process has
identified as being alternative versions of the same sequence
(for example, truncated seqeunces, variant sequences, etc.).
Why do the sizes of Ensembl and IPI data sets differ so much?
IPI is built in order to provide maximum coverage of the major publicly available protein
(and gene) databases, yet also to minimize the redundancy of such this large body of data
(more than 200,000 source database entries are reduced to 56000 entries in IPI human v3.12).
This is done by merging data from different data source entries into a single IPI entry when
there is evidence that these source entries represent the same protein (i.e. a particular
gene product).
But while we would like to reduce IPI to the minimum possible size, there are a number of ways in
which the source data (as presented) is insufficiently consistent to allow us to merge data:
-
Some entries with similar protein sequences are not merged in IPI because they have
cross-references to different entries in a gene database (e.g. the HGNC or Entrez Gene
databases), suggesting they are the products of different genes.
-
An inflation in the set size may be an inevitable consequence of scaling up methods of
pairwise comparison used to identifying matching entries to increasing numbers of data
sources. Consider the following situation:
Entry pA1 from database A is the best reciprocal match of entry pB1 in database B, and pB1
is the best reciprocal match of pC1 in database C, and pC1 is the best reciprocal match
of pA2 in database A. If database A is supposed to be internally non-redundant, this
implies that there are at least 2 different proteins represented by these 4 entries: if
pB1 and pC1 have the weakest match, we would suggest 2 IPI entries, one mapping to pA1
and pB1, and one mapping to pA2 and pC1. But if one considered databases B and C alone,
one might map pB1 to pC1 and identify them as a single protein product.
The data set created by the IPI process is therefore liable to be larger than the data set that
would be produced if, for example, one were to take all the Ensembl human sequences, and add
additional sequences from other data sources who sequence similarity with an Ensembl entry lies
below the thresholds used in IPI. However, all cross-references in an IPI entry are mutually
compatible, and the size of the IPI set reflects the reported diversity in sequence and annotation
represented in the data sources.
Why do IPI identifiers change?
Every effort is made to maintain stable IPI identifiers. When identifiers disappear from source databases
attempts are made to propagate the corresponding IPI identifiers onto the IPI entries representing their successors.
But often there is no clear successor for a disappeared entry, or two entries from one source database (each
previously each represented by a separate IPI entry) are merged into a single entry (so one IPI identifier becomes
redundant).
A recent development has been the introduction of secondary ACs into IPI so that redundant IPI identifiers can
be tracked to their successors. Click here for details.
For a full description of how IPI identifiers are propagated, click
here.
How can I track the history of a deleted or secondary IPI identifier?
For each species, an IPI history file is released. This file details the releases for which each IPI ID was a valid
primary identifier; in the case of entries that have become secondary, it also details the primary identifier to
which they have become secondary; for entries that have been deleted, the reason for deletion is given.
The IPI history files can be searched dynamically via the Quick Search box on the IPI homepage, or via the EBI SRS server.
To use the SRS server, click on the "Library page" tab and select "IPI history" under "Other
protein sequence databases." You can search against "IPI" and "IPI history" simultaneously
if you do not know if your ID of interest is included in the current release.
Why do protein
names and sequences in IPI differ from those in the UniProt Knowledgebase?
Each IPI entry maps to one or many source database entries, one of which (either the one with the longest
sequence, or the "best annotated") is chosen as the master entry. The master entry provides the IPI entry
with its name and sequence.
Does IPI contain
data from UniProtKB/TrEMBLnew?
No. IPI has never contained data derived from UniProtKB/TrEMBLnew, the update database for the UniProt
Knowledgebase. Entries were added to IPI once the data in UniProtKB/TrEMBLnew was promoted to UniProtKB/TrEMBL in
the subsequent TrEMBL release.
UniProtKB/TrEMBLnew has subsequently been discontinued. The last release was made on 22nd June 2004. All
protein translations submitted to the Genbank/EMBL/DDBJ nucleotide sequence databases are now automatically
incorporated into the next incremental UniProt Knowledgebase release (these occur fortnightly) and subsequently into
IPI.
Do sequences in
IPI contain initiator methionines?
Each IPI entry contains the longest sequence described in a matching source database entry. Thus, if there are
a choice of source database entries, one of whose sequences contains the methionine and one not, the sequence
containing the methionine will be preferred.
Of the source databases currently in use in IPI, UniProtKB/TrEMBL, Ensembl, and RefSeq sequences generally
contain initiator methionines. UniProtKB/Swiss-Prot sequences also contain the initiator methionine, unless this
methionine is believed not to be present in the mature protein (due to proteolytic cleavage). In this case, the
methionine was not included in that sequence prior to UniProtKB release 9.5. After this release, UniProtKB/Swiss-Prot will be changing its representation of sequences to include the initiator methionione in all cases, similar to the other data sources.
Are splice
variants included in IPI as separate entries?
All annotated splice variants are included in IPI as separate entries (unless their protein sequences are
identical). To this end, IPI uses UniProt isoform identifiers to explicitly cross-reference individual isoforms
describe in UniProt Knowledgebase entries (e.g. P13746-1). Un annotated splice variants may be represented as
independent entities or as an aggregated cluster, depending on the degree of sequence similarity. (as there is no a
priori way of distinguishing between sequencing errors and genuine splice events from sequence alone.
What files are
available for download?
For each species in IPI, the following files are available for download (sample file names are given for Homo
sapiens):
The format of each type of file is specified in separate help pages: see the left hand sidebar for links.
How can I retrieve IPI entries using DBfetch?
DBfetch is a web (and web-service) based tool for automatically retreiving database entries from the EBI. It
can be used to retrieve IPI entries by calling an appropriate URL, for example:
For more information, please see this page.
.
How can I see many IPI entries in a single page?
Again, you can do this using dbfetch, for example
For more information, please see this page.
.
Can I merge IPI MySQL species specific database dumps into one MySQL database?
IPI MySQL Dumps are species specific in the same way as other IPI distributed files, and they are intended
to be loaded in different databases as explained in this page. However, the schema and
table
definitions have been changed (since July 2006 Release) to allow the load of different species into a
single MySQL database. Here are some guide lines on how to achieve this:
-
load the first species dump normally without editing it, e.g.
mysql -h host_name -u username -ppassword IPI < ipi.HUMAN.mysql
-
then edit the following species dumps you want to load into the same MySQL database to remove lines starting
with "DROP TABLE IF EXISTS", e.g.
DROP TABLE IF EXISTS `organism`;
DROP TABLE IF EXISTS `release`;
DROP TABLE IF EXISTS `data_source`;
etc...
-
finally, load the edited files using MySQL '--force' and '--disable-named-commands' options, e.g.
mysql -h host_name -u username -ppassword --force --disable-named-commands IPI <
ipi.MOUSE.mysql.edited
mysql -h host_name -u username -ppassword --force --disable-named-commands IPI <
ipi.RAT.mysql.edited
etc...
On linux you can merge steps 2 and 3 in a single command line, e.g.
sed s/DROP TABLE IF EXISTS.*// ipi.MOUSE.mysql | mysql -h host_name -u username -ppassword --force
--disable-named-commands IPI
sed s/DROP TABLE IF EXISTS.*// ipi.RAT.mysql | mysql -h host_name -u username -ppassword --force
--disable-named-commands IPI
etc...
Once the data have been loaded, you can check that the cumulative number of rows effectively loaded
corresponds to the statistics available in table `release` (ie. protein_entry_count, protein_count, etc..).
 |