spacer
spacer

IPI - International Protein Index - Frequently Asked Questions



How can I identify the source of an IPI entry?

The source of each IPI entry can be identified by downloading the data set in UniProt (Swiss-Prot) format (download the files ending .dat from the FTP server). In the cross-references section of each entry, the cross-reference to the database entry that provides the sequence of the IPI entry is marked by the presence of the letter 'M' in the fourth field.
For example

        ID IPI00177321.1 IPI; PRT; 316 AA.
        DR RefSeq/predicted; XP_168060; GI:22060273; M.
        DR ENSEMBL; ENSP00000343431; ENSG00000189070; -.
    

In this entry, an Ensembl and a RefSeq entry have been merged into one IPI entry, and the RefSeq sequence has been used.




What is the difference between IPI and UniProtKB?

UniProt (Universal Protein Resource) is a comprehensive catalog of information on proteins. It is a central repository of protein sequence and function created by joining the information contained in Swiss-Prot, TrEMBL, and PIR.
UniProt is comprised of three components, each optimised for different uses. The UniProt Knowledgebase (UniProtKB) is the central access point for extensive curated protein information, including function, classification, and cross-reference. The UniProt Non-redundant Reference (UniRef) databases combine closely related sequences into a single record to speed searches. The UniProt Archive (UniParc) is a comprehensive repository, reflecting the history of all protein sequences.
The UniProt Knowledgebase contains protein data from all species where it is available. This data includes protein sequences determined by direct experiment and derived from the sequencing of individual DNA clones or RNA molecules. It does not, however, necessarily include predictions of protein sequences derived from the complete genome sequence of every organism where this has been determined. This is particularly an issue in higher eukaryotes. Methods for protein prediction in these species are still undergoing improvement and the predictions of groups (such as Ensembl and RefSeq) derived using these methods therefore manifest some instability. Additionally, some years before the sequence of an organism is completed, a preliminary assembly of its genome may become available, from which is it possible to make provisional protein predictions that will subsequently need revision. For these reasons, protein predictions in these species are often not submitted to the EMBL/Genbank/DDBJ nucleotide sequence databases, and do not appear in the UniProt Knowledgebase.
IPI protein sets are made for a limited number of higher eukaryotic species whose genomic sequence has been completely determined but where there are a large number of predicted protein sequences that are not yet in UniProt. IPI takes data from UniProt and also from sources of such predictions, and combines them non-redundantly into a comprehensive proteome set for each species.




What is the difference between IPI and UniParc?
UniParc (the UniProt archive) is a database of protein sequences. Every UniParc Identifier (UPI) is unique and stable for a particular sequence.
IPI is a database of annotated proteins. Thus if the sequence associated with a particular IPI entry changes, the IPI ID associated with it will usually remain the same. Conversely, it is possible for the same sequence to have two different IPI IDs, if that sequence is associated with different source database entries in different releases.
The following example illustrates what would happen if, between two IPI releases, the sequence of source database entry A1 changed from AAA to PPP and the sequence of source database entry A2 changed from MMM to AAA. IPI IDs would remain stably associated with A1 and A2; A2 would acquire the UPI previously assigned to A1; and A1 would get a new UPI:

IPI Release Source Database Entry ID (version) Sequence UniParc ID IPI ID
1 A1 (v1) AAA UPI 1 IPI 1
  A2 (v1) MMM UPI 2 IPI 2
2 A1 (v2) PPP UPI 3 IPI 1
  A2 (v2) AAA UPI 1 IPI 2


IPI contains cross-references to UniParc (in the .dat and .xrefs files), and the two resources can be used in conjunction.




What is the difference between IPI and UniRef 100?
IPI is included in NRef100. NRef100 also includes certain other sequences (from the UniProtKB, in particular, and also from some other sources) that are distinct from sequences in IPI, but which the IPI process has identified as being alternative versions of the same sequence (for example, truncated seqeunces, variant sequences, etc.).




Why do the sizes of Ensembl and IPI data sets differ so much?
IPI is built in order to provide maximum coverage of the major publicly available protein (and gene) databases, yet also to minimize the redundancy of such this large body of data (more than 200,000 source database entries are reduced to 56000 entries in IPI human v3.12). This is done by merging data from different data source entries into a single IPI entry when there is evidence that these source entries represent the same protein (i.e. a particular gene product).
But while we would like to reduce IPI to the minimum possible size, there are a number of ways in which the source data (as presented) is insufficiently consistent to allow us to merge data:

  • Some entries with similar protein sequences are not merged in IPI because they have cross-references to different entries in a gene database (e.g. the HGNC or Entrez Gene databases), suggesting they are the products of different genes.

  • An inflation in the set size may be an inevitable consequence of scaling up methods of pairwise comparison used to identifying matching entries to increasing numbers of data sources. Consider the following situation:
    Entry pA1 from database A is the best reciprocal match of entry pB1 in database B, and pB1 is the best reciprocal match of pC1 in database C, and pC1 is the best reciprocal match of pA2 in database A. If database A is supposed to be internally non-redundant, this implies that there are at least 2 different proteins represented by these 4 entries: if pB1 and pC1 have the weakest match, we would suggest 2 IPI entries, one mapping to pA1 and pB1, and one mapping to pA2 and pC1. But if one considered databases B and C alone, one might map pB1 to pC1 and identify them as a single protein product.
The data set created by the IPI process is therefore liable to be larger than the data set that would be produced if, for example, one were to take all the Ensembl human sequences, and add additional sequences from other data sources who sequence similarity with an Ensembl entry lies below the thresholds used in IPI. However, all cross-references in an IPI entry are mutually compatible, and the size of the IPI set reflects the reported diversity in sequence and annotation represented in the data sources.




Why do IPI identifiers change?
Every effort is made to maintain stable IPI identifiers. When identifiers disappear from source databases attempts are made to propagate the corresponding IPI identifiers onto the IPI entries representing their successors. But often there is no clear successor for a disappeared entry, or two entries from one source database (each previously each represented by a separate IPI entry) are merged into a single entry (so one IPI identifier becomes redundant).
A recent development has been the introduction of secondary ACs into IPI so that redundant IPI identifiers can be tracked to their successors. Click here for details.
For a full description of how IPI identifiers are propagated, click here.




How can I track the history of a deleted or secondary IPI identifier?
For each species, an IPI history file is released. This file details the releases for which each IPI ID was a valid primary identifier; in the case of entries that have become secondary, it also details the primary identifier to which they have become secondary; for entries that have been deleted, the reason for deletion is given.

The IPI history files can be searched dynamically via the Quick Search box on the IPI homepage, or via the EBI SRS server. To use the SRS server, click on the "Library page" tab and select "IPI history" under "Other protein sequence databases." You can search against "IPI" and "IPI history" simultaneously if you do not know if your ID of interest is included in the current release.




Why do protein names and sequences in IPI differ from those in the UniProt Knowledgebase?
Each IPI entry maps to one or many source database entries, one of which (either the one with the longest sequence, or the "best annotated") is chosen as the master entry. The master entry provides the IPI entry with its name and sequence.




Does IPI contain data from UniProtKB/TrEMBLnew?
No. IPI has never contained data derived from UniProtKB/TrEMBLnew, the update database for the UniProt Knowledgebase. Entries were added to IPI once the data in UniProtKB/TrEMBLnew was promoted to UniProtKB/TrEMBL in the subsequent TrEMBL release.
UniProtKB/TrEMBLnew has subsequently been discontinued. The last release was made on 22nd June 2004. All protein translations submitted to the Genbank/EMBL/DDBJ nucleotide sequence databases are now automatically incorporated into the next incremental UniProt Knowledgebase release (these occur fortnightly) and subsequently into IPI.




Do sequences in IPI contain initiator methionines?
Each IPI entry contains the longest sequence described in a matching source database entry. Thus, if there are a choice of source database entries, one of whose sequences contains the methionine and one not, the sequence containing the methionine will be preferred.
Of the source databases currently in use in IPI, UniProtKB/TrEMBL, Ensembl, and RefSeq sequences generally contain initiator methionines. UniProtKB/Swiss-Prot sequences also contain the initiator methionine, unless this methionine is believed not to be present in the mature protein (due to proteolytic cleavage). In this case, the methionine was not included in that sequence prior to UniProtKB release 9.5. After this release, UniProtKB/Swiss-Prot will be changing its representation of sequences to include the initiator methionione in all cases, similar to the other data sources.




Are splice variants included in IPI as separate entries?
All annotated splice variants are included in IPI as separate entries (unless their protein sequences are identical). To this end, IPI uses UniProt isoform identifiers to explicitly cross-reference individual isoforms describe in UniProt Knowledgebase entries (e.g. P13746-1). Un annotated splice variants may be represented as independent entities or as an aggregated cluster, depending on the degree of sequence similarity. (as there is no a priori way of distinguishing between sequencing errors and genuine splice events from sequence alone.




What files are available for download?
For each species in IPI, the following files are available for download (sample file names are given for Homo sapiens):

File Type File Name File Description
Fasta file ipi.HUMAN.fasta Header plus sequence. Header contains IPI ID, cross-references to major source databases, and some basic annotation (e.g. protein description)
UniProt (Swiss-Prot) format file ipi.HUMAN.dat Annotated sequence database entry with multiple cross-references, taxonomy, versioning information, etc.
Cross references file ipi.HUMAN.xrefs Tab-delineated file containing cross-references for each protein in IPI (file does not contain sequence)
Gene cross-references file ipi.genes.HUMAN.xrefs Tab-delineated file containing chromosomal location information for all genes encoding UniProtKB and Ensembl proteins, derived through the IPI process
InterPro matches files ipi.HUMAN.IPC Tab-delineated file containing information about protein domains, families and motifs identified by InterProScan in sequences from IPI
History files ipi.HUMAN.history Tab-delineated file giving the history of each IPI identifier. In this file, deleted identifiers are listed (together with the reason for their deletion), and secondary identifiers are tracked to their successors.
GOA files gene_association.goa_human (on the GOA FTP site) Files detailing annotations made using the GO controlled vocabulary for all proteins in the IPI data sets. Available via the GOA (GO Annotation) FTP site

The format of each type of file is specified in separate help pages: see the left hand sidebar for links.




How can I retrieve IPI entries using DBfetch?
DBfetch is a web (and web-service) based tool for automatically retreiving database entries from the EBI. It can be used to retrieve IPI entries by calling an appropriate URL, for example:

For more information, please see this page.
.




How can I see many IPI entries in a single page?
Again, you can do this using dbfetch, for example

For more information, please see this page.
.




Can I merge IPI MySQL species specific database dumps into one MySQL database?
IPI MySQL Dumps are species specific in the same way as other IPI distributed files, and they are intended to be loaded in different databases as explained in this page. However, the schema and table definitions have been changed (since July 2006 Release) to allow the load of different species into a single MySQL database. Here are some guide lines on how to achieve this:

  1. load the first species dump normally without editing it, e.g.

    mysql -h host_name -u username -ppassword IPI < ipi.HUMAN.mysql


  2. then edit the following species dumps you want to load into the same MySQL database to remove lines starting with "DROP TABLE IF EXISTS", e.g.

    DROP TABLE IF EXISTS `organism`;
    DROP TABLE IF EXISTS `release`;
    DROP TABLE IF EXISTS `data_source`;
    etc...


  3. finally, load the edited files using MySQL '--force' and '--disable-named-commands' options, e.g.

    mysql -h host_name -u username -ppassword --force --disable-named-commands IPI <
    ipi.MOUSE.mysql.edited
    mysql -h host_name -u username -ppassword --force --disable-named-commands IPI <
    ipi.RAT.mysql.edited
    etc...


On linux you can merge steps 2 and 3 in a single command line, e.g.

sed s/DROP TABLE IF EXISTS.*// ipi.MOUSE.mysql | mysql -h host_name -u username -ppassword --force
--disable-named-commands IPI
sed s/DROP TABLE IF EXISTS.*// ipi.RAT.mysql | mysql -h host_name -u username -ppassword --force
--disable-named-commands IPI
etc...


Once the data have been loaded, you can check that the cumulative number of rows effectively loaded corresponds to the statistics available in table `release` (ie. protein_entry_count, protein_count, etc..).





















spacer
spacer