spacer
spacer

IPI - International Protein Index - UniProt Format

IPI is released in a pseudo-UniProt format to supplement the original FASTA format file. The UniProt format file contains extra cross reference information linking IPI to CleanEx, EPD, HGNC, GO, Interpro, Entrez Gene, MGI, ReAlSplice, RGD, RZPD, S/MARt DB, Transfac, UniParc, UTRdb, and ZFIN, and identifies the chromowhich the gene encoding each IPI entry is found. To avoid potential contradictions, additional cross references are taken only from the master entry behind each IPI entry.

A sample entry is shown below:

                ID IPI00003881.5 IPI; PRT; 415 AA.
                AC IPI00003881;
                DT 01-OCT-2001 (IPI Human rel. 2.00, Created)
                DT 06-OCT-2005 (IPI Human rel. 3.11, Last sequence update)
                DE SIMILAR TO HETEROGENEOUS NUCLEAR RIBONUCLEOPROTEIN H.
                OS Homo sapiens (Human).
                OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
                OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
                OX NCBI_TaxID=9606;
                CC -!- GENE_LOCATION: Chr. 10:43201071-43224620:-1.
                DR UniProtKB/Swiss-Prot; P52597; HNRPF_HUMAN; -.
                DR Vega; OTTHUMP00000019482; OTTHUMG00000018029; M.
                DR Vega; OTTHUMP00000043413; OTTHUMG00000018029; -.
                DR Vega; OTTHUMP00000043414; OTTHUMG00000018029; -.
                DR ENSEMBL_HAVANA; ENSP00000348345; ENSG00000169813; -.
                DR ENSEMBL_HAVANA; ENSP00000349573; ENSG00000169813; -.
                DR ENSEMBL_HAVANA; ENSP00000363572; ENSG00000169813; -.
                DR REFSEQ_REVIEWED; NP_004957; GI:4826760; -.
                DR UniProtKB/TrEMBL; Q5T0N2; Q5T0N2_HUMAN; -.
                DR UniProtKB/TrEMBL; Q8NI96; Q8NI96_HUMAN; -.
                DR UniProtKB/TrEMBL; Q96AU2; Q96AU2_HUMAN; -.
                DR ENSEMBL; ENSP00000338477; ENSG00000169813; -.
                DR ENSEMBL; ENSP00000348345; ENSG00000169813; -.
                DR H-InvDB; HIT000003838; HIX0008779; -.
                DR H-InvDB; HIT000030409; HIX0008779; -.
                DR H-InvDB; HIT000031821; HIX0008779; -.
                DR H-InvDB; HIT000037199; HIX0008779; -.
                DR H-InvDB; HIT000037659; HIX0008779; -.
                DR UniParc; UPI0000000C5C; -; -.
                DR HGNC; 5039; HNRPF; -.
                DR Entrez Gene; 3185; HNRPF; -.
                DR UniGene; Hs.808; -; -.
                DR CCDS; CCDS7204.1; -; -.
                DR ReAlSplice protein; SL0000062; hnRNPF; factor involved in alternative splicing.
                DR trome; HTR002991; -; -.
                DR RZPD; Hs.808; -; Clones and other research material.
                DR CleanEx; HS_HNRPF; -; -.
                DR InterPro; IPR012677; a_b_plait_nuc_bd.
                DR InterPro; IPR000504; RNP1_RNA_bd.
                DR InterPro; IPR012996; Znf_CHHC.
                DR Pfam; PF00076; RRM_1; 3.
                DR Pfam; PF08080; zf-RNPHF; 1.
                DR SMART; SM00360; RRM; 3.
                DR PROSITE; PS50102; RRM; 2.
                DR GENE3D; G3D.3.30.70.330; Nucl_bd_a/b_plat; 3.
                SQ SEQUENCE 415 AA; 45672 MW; D14E170631FB1F31 CRC64;
                MMLGPEGGEG FVVKLRGLPW SCSVEDVQNF LSDCTIHDGA AGVHFIYTRE GRQSGEAFVE
                LGSEDDVKMA LKKDRESMGH RYIEVFKSHR TEMDWVLKHS GPNSADSAND GFVRLRGLPF
                GCTKEEIVQF FSGLEIVPNG ITLPVDPEGK ITGEAFVQFA SQELAEKALG KHKERIGHRY
                IEVFKSSQEE VRSYSDPPLK FMSVQRPGPY DRPGTARRYI GIVKQAGLER MRPGAYSTGY
                GGYEEYSGLS DGYGFTTDLF GRDLSYCLSG MYDHRYGDSE FTVQSTTGHC VHMRGLPYKA
                TENDIYNFFS PLNPVRVHIE IGPDGRVTGE ADVEFATHEE AVAAMSKDRA NMQHRYIELF
                LNSTTGASNG AYSSQVMQGM GVSAAQATYS GLESQSVSGC YGAGYSGQNS MGGYD
                //

An explanation of the line types is as follows:

ID line

id= IPI Identifier with version number, Data Class = 'IPI'

AC line

Current IPI accession number, followed by secondary identifiers

DE line

Description line, taken from the master sequence for this IPI entry

OS, OC, OX lines

Taxonomic classification

CC (comment) lines

In IPI, the comment line is used to provide the genomic location of the gene(s) to which an IPI entry has been mapped to. The location information is based on the latest Ensembl assembly build.
  • "-!- GENE_LOCATION: ", to be followed by a description of a genomic location on which this gene is believed to be located.
The description of a genomic location contains Chromosome location, start coordinate, end coordinate and strand, where the structure looks as followed:

Chr. 10:43201071-43224620:-1
Chr. <Chromosome Location>:<start coordinate>- <end coordinate>:<strand>

  • "Chromosome Location": the name of the chromosome on which this gene is believed to be located.
  • "start coordinate": the lowest genomic location (based on latest assembly build used by Ensembl) of the different transcripts expressed by the Ensembl gene mapped to this IPI entry.
  • "end coordinate": the highest genomic location (based on latest assembly build used by Ensembl) of the different transcripts expressed by the Ensembl gene mapped to this IPI entry.
  • "strand": the strand from which these transcripts are expressed (1 for FORWARD and -1 for REVERSE).

DR (database cross-reference) lines

Cross references in IPI can be to any of the constituent databases. The master entry of each IPI entry (the entry which supplies the IPI entry with its sequence and description line) is indicated by the presence of an 'M' in the fourth field of its cross-reference. Additional cross-references are added to a number of other databases (usually by inference from the master entry).
In more detail, the individual cross reference types are:
Database Fields Notes
CleanEx Database name; Entry ID; nothing; nothing.  
Entrez Gene Database name; Gene ID; Default gene symbol; nothing.  
ENSEMBL Database name; Peptide ID; Gene ID; 'M' if master.  
ENSEMBL_HAVANA Database name; Peptide ID; Gene ID; 'M' if master. Corresponds to Havana curated subset of Ensembl
EPD Database name; Entry AC; Entry ID; keyword.  
GENE3D Database name; Method AC; Method name; nothing. Now available for all IPI entries
GO Database name; GO ID; GO term; nothing. Available shortly for all entries with UniProtKB or Ensembl master sequences
HGNC Database name; HGNC number; HGNC official gene symbol; nothing.  
H-InvDB Database name; H-Inv cDNA ID; H-Inv cluster ID; 'M' if master.  
InterPro Database name; Entry AC; Entry name; nothing. Now available for all IPI entries
MGI Database name; MGI ID; Gene symbol; nothing.  
PathoSign Database name; Genotype AC; Mutated Molecule AC; Phenotype.  
Pfam Database name; Method AC; Method name; Number of hits. Now available for all IPI entries
PRINTS Database name; Method AC; Method name; nothing. Now available for all IPI entries
ProDom Database name; Method AC; Method name; Number of Hits. Now available for all IPI entries
PROSITE Database name; Method AC; Method name; Number of hits. Now available for all IPI entries
ReAlSplice Database name; Gene ID; Gene Name; nothing.  
REFSEQ Database name plus entry revision status; Entry AC; GI number; 'M' if master. REFSEQ_UNKNOWN_STATUS is used when no status was found for the corresponding RefSeq entry
RGD Database name; RGD ID; Gene symbol; nothing.  
RZPD Database name; Entry ID; Entry Name; keyword.  
S/MARt DB Database name; Gene ID; Gene Name; nothing.  
Smart Database name; Method AC; Method name; Number of hits. Now available for all IPI entries
TAIR Gene Database name; TAIR locus ID; TAIR gene symbol or gene alias if no symbol or locus ID if no alias; nothing.  
TAIR Protein Database name; TAIR protein isoform ID; TAIR gene model ID; 'M' if master.  
Transfac Database name; Entry ID; nothing; nothing. IDs look like T00001 for factors, G000001 for genes and R00001 for sites
trome Database name; Entry ID; Entry AC; molecule code.  
UniParc Database name; UniParc ID; nothing; nothing.  
UniProtKB/Swiss-Prot Database name; Entry AC; Entry ID; 'M' if master. The entry AC will be replaced by a specific Isoform Id where, as in this entry, several alternative isoforms are identified within a single entry
UniProtKB/TrEMBL Database name; Entry AC; Entry ID; 'M' if master.  
UTRdb and UTRsite Database name; UTR ID; UTR site ID; position.  
Vega Database name; Peptide ID; Gene ID; 'M' if master.  
ZFIN Database name; ZFIN ID; Gene symbol; nothing.  

SQ lines

Display the sequence of an IPI record, taken from its master entry


Format change notices

From the 7 August 2007 onwards, which corresponds to human 3.32, mouse 3.32, rat 3.32, zebrafish 3.31, arabidopsis 3.30, chicken 3.26 and cow 3.18 releases, CC (comment) lines have been changed to allow more than one location to be given in an entry (in the case where a single protein maps to more than one gene).

e.g.
Before:

                CC -!- CHROMOSOME: 6.
                CC -!- START CO-ORDINATE: 31853274.
                CC -!- END CO-ORDINATE: 31871565.
                CC -!- STRAND: -1.
Now:
                CC   -!- GENE_LOCATION: Chr. 6:31892491-31896096:-1.
                CC   -!- GENE_LOCATION: Chr. 6:31853274-31871565:-1.



From the 22 February 2006 onwards, which corresponds to human/mouse/rat 3.15, zebrafish 3.14, arabidopsis 3.13, chicken 3.09 and cow 3.01 releases, REFSEQ_NP and REFSEQ_XP database codes will be replaced by REFSEQ_STATUS where 'STATUS' represents the RefSeq entry revision status (or UNKNOWN_STATUS if no status available).
e.g.
Previously:
                            DR REFSEQ_NP; NP_061183; GI:52351208; -.
                            DR REFSEQ_NP; NP_001324; GI:23110960; -.
                            DR REFSEQ_NP; NP_005617; GI:21361282; -.
                            DR REFSEQ_NP; NP_001034191; GI:84993245; -.
                            DR REFSEQ_XP; XP_114618; GI:41148435; -.
                            DR REFSEQ_NP; NP_001025036; GI:71274178; -.
Revised format:
                            DR REFSEQ_VALIDATED; NP_061183; GI:52351208; -.
                            DR REFSEQ_REVIEWED; NP_001324; GI:23110960; -.
                            DR REFSEQ_PROVISIONAL; NP_005617; GI:21361282; -.
                            DR REFSEQ_PREDICTED; NP_001034191; GI:84993245; -.
                            DR REFSEQ_MODEL; XP_114618; GI:41148435; -.
                            DR REFSEQ_INFERRED; NP_001025036; GI:71274178; -.

                            DR   REFSEQ_UNKNOWN_STATUS; AP_000639; GI:58615663; -.


From April 2004 onwards, which corresponds to human 2.31, mouse 1.24 and rat 1.14 releases, secondary IPI numbers have been added (after current accession number) to the AC lines of the UniProt format files. The entry version number has been moved to the ID line.
e.g.
Before:
                            ID IPI00013881 IPI; PRT; 449 AA.
                            AC IPI00013881.4;
Now:
                            ID IPI00013881.4 IPI; PRT; 449 AA.
                            AC IPI00013881; IPI00155062; IPI00334833;
spacer
spacer