spacer

User Manual

To cite InterPro please use:-

Sarah Hunter; Philip Jones; Alex Mitchell; Rolf Apweiler; Teresa K. Attwood; Alex Bateman; Thomas Bernard; David Binns; Peer Bork; Sarah Burge; Edouard de Castro; Penny Coggill; Matthew Corbett; Ujjwal Das; Louise Daugherty; Lauranne Duquenne; Robert D. Finn; Matthew Fraser; Julian Gough; Daniel Haft; Nicolas Hulo; Daniel Kahn; Elizabeth Kelly; Ivica Letunic; David Lonsdale; Rodrigo Lopez; Martin Madera; John Maslen; Craig McAnulla; Jennifer McDowall; Conor McMenamin; Huaiyu Mi; Prudence Mutowo-Muellenet; Nicola Mulder; Darren Natale; Christine Orengo; Sebastien Pesseat; Marco Punta; Antony F. Quinn; Catherine Rivoire; Amaia Sangrador-Vegas; Jeremy D. Selengut; Christian J. A. Sigrist; Maxim Scheremetjew; John Tate; Manjulapramila Thimmajanarthanan; Paul D. Thomas; Cathy H. Wu; Corin Yeats; Siew-Yit Yong
InterPro in 2011: new developments in the family and domain prediction database (2011).
Nucleic Acids Research 2011; doi: 10.1093/nar/gkr948

InterPro Funding

Current InterPro Funding:

Impact/E infrastructure logo

InterPro is currently funded by grant number 213037 from the European Union under the program "FP7 capacities: Scientific Data Repositories". The working title for the project is IMproving Protein Annotation and Co-ordination using Technology (IMPACT).

InterPro is also funded by grant BB/F010508/1 from the BBSRC Bioinformatics and Biological Resources Fund.

Previous InterPro Funding:

InterPro was funded by the award of grant number QLRI-CT-2000-00517 and in part by grant number QLRI-CT-2001000015 from the European Union under the RTD program "Quality of Life and Management of Living Resources". InterPro was also part of the MRC-funded eFamily project.

What is InterPro?

InterPro is an integrated documentation resource for protein families, domains, regions and sites. InterPro combines a number of databases (referred to as member databases) that use different methodologies and a varying degree of biological information on well-characterised proteins to derive protein signatures. By uniting the member databases, InterPro capitalises on their individual strengths, producing a powerful integrated database and diagnostic tool (InterProScan).

The member databases use a number of approaches:

  1. ProDom: provider of sequence-clusters built from UniProtKB using PSI-BLAST.
  2. PROSITE patterns: provider of simple regular expressions.
  3. PROSITE and HAMAP profiles: provide sequence matrices.
  4. PRINTS provider of fingerprints, which are groups of aligned, un-weighted Position Specific Sequence Matrices (PSSMs).
  5. PANTHER, PIRSF, Pfam, SMART, TIGRFAMs, Gene3D and SUPERFAMILY: are providers of hidden Markov models (HMMs).

Diagnostically, these resources have different areas of optimum application owing to the different underlying analysis methods. In terms of family coverage, the protein signature databases are similar in size but differ in content. While all of the methods share a common interest in protein sequence classification, some focus on divergent domains (e.g., Pfam), some focus on functional sites (e.g., PROSITE), and others focus on families, specialising in hierarchical definitions from superfamily down to subfamily levels in order to pin-point specific functions (e.g., PRINTS). TIGRFAMs focus on building HMMs for functionally equivalent proteins and PIRSF always produce HMMs over the full length of a protein and have protein length restrictions to gather family members. HAMAP profiles are manually created by expert curators they identify proteins that are part of well-conserved bacterial, archaeal and plastid-encoded proteins families or subfamilies. PANTHER build HMMs based on the divergence of function within families. SUPERFAMILY and Gene3D are based on structure using the SCOP and CATH superfamilies, respectively, as a basis for building HMMs.

Integration into InterPro

Signatures describing the same protein family or domain are grouped into unique InterPro entries. Each combined InterPro entry has a unique accession number, an abstract describing the features of proteins associated with the entry, literature references and has links to the relevant member database(s). All UniProtKB protein sequences that have matches to a particular InterPro entry are listed in the Match Table associated with that entry. There are also links to the InterPro graphical views.

Content

The content of the current release; statistics, lists of the entries by Type and matches to member database methods (both integrated and unintegrated) are provided in the Release Notes.

Previous releases are listed in the Documentation.

How is it useful?

Protein signature databases have become vital tools for identifying distant relationships in novel sequences and hence are used for the classification of protein sequences and for inferring their function. InterPro streamlines the analysis of newly determined sequences for the individual user and makes a significant contribution to the demanding task of automatic annotation of predicted proteins from genome sequencing projects.

InterPro provides internal consistency checks and deeper coverage, making it more efficient and reliable than using each of the pattern databases separately. This unified approach improves both the utility and the coverage of pattern databases, pin-pointing weaknesses and facilitating their further development.

Structure of an InterPro Entry

The information fields (or sections) in bold are the core fields, while those in italics are only displayed when applicable; all information field headers are linked to the relevant section in the User Manual. The fields in an entry are: Header: Accession number and Name, UniProtKB match count and match views, Accession: accession number and short name, Secondary accession number, Type, Signatures, Parent, Children, Found in, Contains, Gene Ontology Terms, Abstract, Structural Links, Database Links, IntAct Links, Taxonomy coverage, Overlapping InterPro entries, Example proteins, Publications and Additional Reading.

Entry Header: Accession Number and Name

Every InterPro entry has an accession number of the form IPRXXXXXX, where X is a digit. The accession number provides a stable way of identifying InterPro entries. InterPro accession numbers are stable and therefore allow unambiguous citation of database entries.

The InterPro entry name describes the InterPro entry and should give an idea of the type of protein matches for that entry.

UniProtKB Matches and Match Views

UniProtKB match counts are calculated for all UniProtKB proteins, which are a combination of information from UniProtKB/Swiss-Prot, UniProtKB/TrEMBL and PIR. For more information go to the UniProtKB home page.

Match lists give a number of different views of the signature matches on the sequences in each InterPro entry. These include:

  • Overview or Compact view : sorted by accession, by name, of known structure or proteins with splice variants.
  • Detailed view: sorted by accession, by name, of known structure or proteins with splice variants.
  • Table view: for all matching proteins or those of known structure.
  • Architectures: view of matching proteins with domain structures.
  • Accession list: provides an unpaginated list of all protein accessions for the entry.

Match information includes the protein sequence accession number, the accession number of the signature and the position of the signature on the protein sequence.

Match Status

For all member database methods, with the exception of PROSITE patterns, matches to UniProtKB proteins are considered to be TRUE if the score is above the individual threshold(s) given by the member database and are thus flagged as T and are displayed. Matches falling below the set threshold are not displayed.

For PROSITE pattern matches to UniProt/Swiss-Prot sequences where the match was of status TRUE, the sequences surrounding the pattern match were used, in addition to the pattern, to construct 'miniprofiles'. If a match to the miniprofile was above its set threshold then the match to the PROSITE pattern was considered to be TRUE and the match displayed, otherwise matches are not displayed, irrespective of their manual match status in UniProtKB/Swiss-Prot. For a full description of PROSITE manual status' see UniProt Knowledgebase user manual and for a full description of the use of miniprofiles in evaluating PROSITE pattern matches see Nucleic Acids Res. 38 (Database Issue):D245-D249.

Protein Match Views

The protein matches for InterPro entries can be viewed in a number of different ways. For each match display there are various options:

  • Select: Select protein(s) by InterPro accession, method accession, architecture code, all or protein accession(s).
  • Refine: Select splice variants or proteins with known structure or both. Select taxonomy by Tax ID or non-redundant proteome.
  • Display: Select the format of the display - compact, detailed, architectures, table, protein, FASTA or UniProtKB accession list.
  • Sort: Select the order of proteins in the display by UniProtKB accession (AC) or by name (ID).

More details on each option are listed below.

Select

  • By InterPro Accession: selects proteins having a match to an InterPro entry. This is restricted to true matches for the compact view.
  • By Signature Accession: this displays all proteins having a match to the supplied member database method accession, whether the method accession is integrated or unintegrated.
  • By InterPro Architecture Code: selects proteins using the Architecture code.
  • All: this displays all protein matches to the current InterPro entry.
  • By UniProtKB Accession(s): displays matches to a supplied UniProtKB accession. For displaying more than one, UniProtKB accessions can be supplied as a comma separated list.

Refine

These options refine the selection.

  • Proteins having Splice Variants: displays splice variants of proteins where they exist. Splice variants associated with a UniProtKB accession that are cross referenced and have a different UniProtKB accession are not displayed. Only available with compact and detailed displays.
  • Proteins of known structure: restricts the selection to proteins of known structure.
  • By Taxonomy - NCBI tax ID(s): restricts the selection of proteins to the taxonomy based on the supplied tax ID.
  • By Non-redundant proteome: restricts the selection of proteins to the non-redundant proteome set based on the supplied Swiss-Prot Organism Species code (OS code): a five uppercase character mnemonic, e.g. HUMAN. A non-redundant proteome only exists for completely sequenced genomes.

Display

  • Compact: shows matches to InterPro Entries.
  • Detailed: shows matches to InterPro member database methods.
  • Architectures: shows InterPro Architectures, count of, example and architecture code.
  • Table: textual representation of matches to methods in the selected entry only.
  • Protein: provides the single protein view of the selection; the architecture (if exists) and the table of method information, which includes method positions.
  • FASTA: provides a FASTA file of the selection.
  • Accession list: provides a list of all the UniProtKB accession numbers of the selection.

Sort

Proteins are sorted by database category - UniProtKB/Swiss-Prot followed by UniProtKB/TrEMBL, and then by alphabetically by UniProtKB accession or ID.

Taxonomy Lineage

A collapsed taxonomy lineage is displayed for the proteins associated with the InterPro entry that is sorted alphabetically by taxonomic group and then by species name. The taxonomy lineages are 'clickable'; clicking on a particular lineage returns only the protein matches for the selected taxonomy and its underlying phylogeny; species being sorted and displayed alphabetically.

Index

Index.
UniProtKB:
S=Swiss-Prot
T=TrEMBL
If there are more than 25 proteins on one page they are split into groups, using the sort order. The index is shown on the left side of the page. Click on each section to view subsets of the selected proteins.

Display: Compact

Each protein is represented as a scaled horizontal line, the protein match line, along which vertical lines are drawn at 10, 20, 50, 100, 200 or 500 amino acid intervals, depending on the length of the protein. The scale is shown to the left of the match graphics.

Coloured bars are displayed along the protein match line to indicate where in a protein matches were found among the InterPro entries. The bar is coloured according to which InterPro entry matched that region of the protein. If multiple InterPro entries match at the same point on an protein match line their match line boxes will be displayed one above the other - the vertical position of a match line box and its colour has no significance. Hovering the mouse over a coloured bar will show the InterPro entry accession (linked to that entry), the entry name and the position of the match on the protein. In addition to matches to InterPro entries, matches to curated structural data, CATH, SCOP, and PDB and to non-curated predicted structural elements defined by SWISS-MODEL and MODBASE are also displayed. The matches to these structural models have fixed colours with white striped lines.

The key near the bottom of the page to identifies which colours correspond to which InterPro entries or structural features.

Matches to unintegrated methods are not displayed.

Display: Detailed

Each protein is represented as a scaled horizontal line, the protein match line, along which vertical lines are drawn at 10, 20, 50, 100, 200 or 500 amino acid intervals, depending on the length of the protein. The scale is shown to the left of the match graphics.

For single protein views, the table of match positions and the InterPro Architecture (coloured lozenges) are also displayed.

Each member database signature with a match to the protein sequence is displayed on a horizontal line. The position of the match reflects its position on the sequence. Hovering the mouse over a coloured bar will show the method database accession number and the residues corresponding to the position of the match on the protein. Where a signature matches a protein more than once and if the matches overlap a single pop-up box provides details of the overlapping matches including the residues corresponding to the position of each match on the protein. The mid-point of each overlap is indicated by a notch on the match bar.

The member database accession number is linked to the member database summary page for that signature and the name of the signature is given in the far right column. Each member database signature is identified by a specific colour:

  • Gene3D - purple
  • HAMAP - aqua
  • PANTHER - brown
  • Pfam-A - dark blue
  • Pfam-B - blue
  • PIRSF - pink
  • PRINTS - green
  • ProDom - light blue
  • PROSITE patterns - yellow
  • PROSITE profiles - orange
  • SMART - red
  • SUPERFAMILY - black
  • TIGRFAMs - teal

The key table near the bottom of the page identifies which colours correspond to which member database method or structural feature.



Unintegrated Signatures and Signature Pages

Each member database method not integrated into InterPro that has a match to the protein sequence is displayed on a horizontal line. The position of the match reflects its position on the sequence. Hovering the mouse over a coloured bar will show the method database accession number and the residues corresponding to the position of the match on the protein.

The accession number is linked to the relevant page of the member database. 'Unintegrated' is linked to the Unintegrated Signature Page. The Unintegrated Signature Page provides the signature accession number and its name, which defaults to the accession number in the absence of a name and the number of proteins the signature matches, with a link to the detailed view of protein matches.

Signatures that are under review are withdrawn from InterPro. Where the removal of the signature would lead to the loss of the InterPro entry, the entry is flagged 'Not for release' and although the entry accession number is provided in the signature lists, no link to the entry page is available. The protein matches to these signatures are not currently computed and a 0 value is returned and displayed on the signature pages.

Structural Features

Structural information is presented for those proteins with a structure in the PDB and for those whose structure is predicted from automated homology-modelling. It is represented on sequences as coloured white striped bars:

  • SCOP, black and white
  • CATH, purple and white
  • PDB, green and white
  • SWISS-MODEL, red and white
  • MODBASE, yellow and white

In the Detailed Graphical View the structural matches are listed at the bottom of the view. Entries are sorted by UniProtKB accession number. The UniProtKB accession number links to the UniProtKB entry record, The GO! link returns the associated GO terms for the protein based on mappings to GO via all sources (see QuickGo). The 'Structure' link returns the PDBe protein entry page for that PDB entry. The links for the SCOP and CATH classification hierarchies, and the domain ID is a link to the domain itself (this ID contains the PDB identifier). For proteins with a predicted structure feature based on either SWISS-MODEL or MODBASE a link to the SWISS-MODEL Repository or MODBASE is provided through the protein accession number.

WARNING: SWISS-MODEL and MODBASE models are theoretically calculated structures, not experimentally determined structures. Therefore the models may contain significant errors.

For those proteins with a known structure clicking on the Astex icon, [Astex logo] second column, loads the AstexViewer(tm) Java applet page displaying the PDB structure, with the residues included in the CATH or SCOP domain definition highlighted on the PDB chain. The view can be rotated by clicking on the left mouse button and moving the cursor over the image, clicking the right mouse button over the image opens a menu to perform various functions, such as adding a ligand to the view. This viewer will require your browser to be java enabled. The software should run on most operating systems and in most Internet browsers (so far we are aware that the viewer does not work on Internet Explorer 4). However, the license agreement must be read and accepted before using the viewer. Should any problems occur with the display, please contact us at: EBI Support.

Display: Table

The Table view displays InterPro methods of this entry, and structural features. The matches for proteins (listed down the left hand side of the table, sorted by UniProtKB accession number) are shown for the match methods (listed along the top of the table). The numbers represent positional coordinates of the match on the protein sequence. Matches can be filtered to all proteins or only proteins with known structure in the bulleted list. Both views are paginated at 25 proteins per page.

Display: Proteins

Key to symbols which may appear in the protein label on the left hand side of the match display for Compact, Detailed and Table views.

UniProtKB: Links to UniProtKB record for that protein.
Accession: Proteins are linked (click on the accession number) to the single-protein detailed view.
Scale: The scale indicates how many amino-acids per vertical line in the match line.
ID: The UniProtKB protein ID is shown.
Structure links to the known protein structure when available.
Fragment indicates that a protein is a fragment*.
Variants link to the protein splice variants when existing.
ADAN displays information on protein-protein interactions of modular domains when existing.
Dasty2 is a protein DAS client, that displays protein annotated features and sequence feature information from other servers when existing.
SPICE displays annotations from PDB, UniProtKB and Ensembl Peptides using the DAS protocol when existing.
GO! links to GO annotation for the protein.
Species name links to detailed view for all proteins for that species; the protein count is provided in the Taxonomy Lineage display.

* UniProtKB fragments with FT NON_CONS and FT NON_TER features.

  • FT NON_TER: The residue at an extremity of the sequence is not the terminal residue. If applied to position 1, this signifies that the first position is not the N-terminus of the complete molecule. If applied to the last position, it means that this position is not the C-terminus of the complete molecule. There is no description field for this key. Examples of NON_TER key feature lines:
    FT NON_TER 1 1
    FT NON_TER 29 29
  • FT NON_CONS: Non-consecutive residues. Indicates that two residues in a sequence are not consecutive and that there are a number of unreported or missing residues between them. Example of a NON_CONS key feature line:
    FT NON_CONS 1683 1684

NON_CONS fragments are not indicated as non-consecutive in InterPro and being non-consecutive the match to methods may be incorrect if the method spans the 'break'.

Display: Architectures

The InterPro Domain Architecture (IDA) Viewer is a graphical representation of protein domain architecture, where the domain architecture of a protein sequence is displayed as a series of non-overlapping domains. The domain architecture is derived from members database signatures that do not overlap on a protein sequence. They are depicted as coloured lozenges; the lozenge name derives from the InterPro entry short name and if repeated more than once, then the multiplication factor follows the short name e.g. - x2. In the example below there is one PAN domain, four Kringle domains followed by one Peptidase S1/S6 domain. In some instance the architectural domains can be nested, one inside the other, for example IDA1254 and IDA719 (1254[719]).

ida1 diagram




Entries are grouped by domain architecture. A count of the number of UniProtKB proteins with the same architecture is provided, an example and the IDA code for each group is also given in the left-hand column. The example is linked to the single-protein detailed view. The IDA code is linked to the compact view for all entries with this domain architecture. There is no relationship between the length of the architecture display and the actual protein length; neither is there any relationship between the length of the lozenge and the length of the member database signatures that they depict.

The key table near the bottom of the page identifies which architecture lozenge colours correspond to which InterPro entry.

Accession Number and Short Name

Every InterPro entry has an accession number of the form IPRXXXXXX, where X is a digit. The accession number provides a stable way of identifying InterPro entries. InterPro accession numbers are stable and therefore allow unambiguous citation of database entries.

The short name is a short, concise name unique to each InterPro entry.

Secondary

Accession numbers provide a stable way of identifying InterPro entries from release to release. When the signatures in an InterPro entry are split or merged to give new or modified entries, then the accession number of the original InterPro entry becomes the secondary accession number in the new or modified InterPro entry.

In a recent change accession numbers are now linked to methods so any accession number that has been associated with a method will become a secondary accession number in the entry in which the method currently appears. In this way it will be possible to trace movement of methods through splitting and merging of entries.

Type

Type defines the entry as a Family, Domain, Region, Repeat or Site. Sites are sub-classified into either Conserved Sites (includes Motifs), Active Sites, Binding Sites or PTMs (Post-translational Modifications).

InterPro has introduced a new Type, Region, and new entry classification rules that affect the typing of entries:

  • Entries typed Repeat or Site remain the same.
  • Entries typed Family or Domain follow stricter criteria to ensure they conform more closely to current biological concepts:
    • Entries typed Family contain signatures that cover all domains in the matching proteins and span >80% of the protein length with no adjacent signatures of type Domain or Region
    • Entries typed Domain identify biological units with defined boundaries, which includes structural and functional domains as well as defined sub-domains.
    • Region is a new InterPro signature type. It defines signatures that cannot be typed as either Family or Domain.
  • New relationship rules have been introduced that affect how different entries are related to one another. Parent/Child and Contains/Found in relationships will continue within InterPro with their existing definitions, but the following changes have been introduced:
    • Entry types are no longer taken into consideration for relationships between entries. Instead, only the sequence covered by the signatures of an entry will be taken into consideration when forming relationships.
    • Parent/Child relationships are permitted between entries of different types: Families, Domains and Regions.
    • Parent/Child relationships are permitted for Repeats and for Sites but NOT to other entry types.
    • All Contains/Found In relationships are displayed in the Relationships section of an entry (previously, only the most specific relationships were displayed).
    • InterPro entries that contain signatures defining Repeats or Sites as their only signature(s) can only be Found In other InterPro entries.
    • Repeats and Sites do NOT affect the typing or relationships of other InterPro entries.

Features of Sequences: Repeats and Sites

InterPro Repeat

A repeat is a region that is not expected to fold into a globular domain on its own. For example 6-8 copies of the WD40 repeat are needed to form a single globular domain. There are also many other short repeat motifs that probably do not form a globular fold.

InterPro Sites:

  • A Conserved Site, otherwise known as 'Motif', is any short sequence pattern that may contain one or more unique residues and cannot be defined as a Active Site, Binding Site or Post-translational Modification (PTM).
  • Active sites are best known as the catalytic pockets of enzymes where a substrate is bound and converted to a product, which is then released. Distant parts of a protein's primary structure may be involved in the formation of the catalytic pocket. Therefore, to describe an active site, one or more signatures will be needed to cover all the active site residues. To be classed as an Active Site the amino acids involved in the reaction must be described and mutational inactivation studies reported.
  • Binding sites bind chemical compounds, which themselves are not substrates for a reaction. The compound, which is bound, may be a required co-factor for a chemical reaction, be involved in electron transport or be involved in protein structure modification. The binding is reversible and to be classed as a Binding Site the amino acids involved in the reaction must be described and mutational inactivation studies reported.
  • A Post-translational Modification modifies the primary protein structure. This modification may be necessary for activation or de-activation of function. Examples include glycosylation, phosphorylation, sulphation and splicing etc. The process of modification may be permanent or reversible and the process may be required for functional activation or deactivation. To be recognised in InterPro the sequence signature must be described.

Lists of entry types:

Family; Domain; Region; Repeat; Conserved Site; Active Site; Binding Site; PTMs.

Signatures

The Signatures field lists the protein signature matches. For each protein signature the Member database, the signature ID, signature name and number of proteins it matches are given. The member database names are linked to their respective home page and the signature IDs are linked to the corresponding entry information page.

Gene Ontology Assignments

Functional classification of the entry is given by listing associated GO terms. The Gene Ontology project (GO) http://www.geneontology.org/ is a dynamic controlled vocabulary defined in three ontology's, molecular function, biological process and cellular component.

  • Molecular function is the action characteristic of a gene product.
  • Biological process describes a phenomenon marked by changes that lead to a particular result, mediated by one or more gene products.
  • Cellular component is the part of a cell of which a gene product is a component; GO includes the extracellular environment of cells; a gene product may be a component of one or more parts of a cell.

For each associated term the name of the term and GO accession number is given. The assignment of GO terms to InterPro entries was done manually by reading the abstract of the entries and annotation of proteins in the protein match table for each entry. An appropriate GO term for an entry is one, which applies to the whole protein. The GO terms associated with an InterPro entry applies to all proteins with true hits to the signatures in that entry. The assignments are incomplete and are ongoing due to the dynamic nature of the GO project. Some entries could be mapped to very low level (specific) GO terms, while entries describing wider families or common domains were mapped to higher level terms or could not be mapped at all. The GO terms and mappings can be found using the EBI QuickGo browser.

It is important to remember these mappings provide useful predictions of GO assignments to the corresponding proteins however, biological exceptions like inactivated enzymes may occur. GO annotation is an ongoing process as new entries are incorporated and old entries are updated.

Abstract

The Abstract describes the signatures in the entry, the protein matches and provides references. Where possible a functional inference is made.

Database Links

Database links include, cross-references to:

  • The BLOCKS database; it contains multiple alignments of conserved regions in protein families
  • The IntEnz database; EC numbers, systematic and common name, synonyms, function and links to other databases e.g. BRENDA, EXPASY, and KEGG. The link to Intenz is provided when >80% of the Swiss-Prot records map to a common EC class (the common stem is reported) or to a specific EC number. TrEMBL records are not considered, unlike PRIAM mapping (see below).
  • The PRIAM database; a provider of enzyme-specific profiles for metabolic pathway prediction. A link is created when a PRIAM profile matchs >80% of the proteins in the InterPro entry.
  • PROSITE documents; e.g. PDOC00020
  • The Carbohydrate-Active EnZymes database; CAZy describes families of related catalytic and carbohydrate-binding modules of enzymes that act on glycosidic bonds
  • The IUPHAR Receptor Database, the International Union of Pharmacology database of GPCR receptors.
  • COMe database, a bioinorganic motif database
  • MEROPS, a database of peptidases and peptidase inhibitors
  • PANDIT, a database of multiple sequence alignments and phylogenetic trees based on Pfam signatures.
  • PDBeMotif, is an integrated resource, which provides information about ligands, sequence and structure motifs, their relative position and the neighbour environment. The details are derived from the PDB together with a mapping to other motif and active-site sources. Search criteria combines sequence motifs, structure motifs, protein sequence, 3D properties, secondary structure elements, 3D associations of small motifs, protein side-chain and main-chain bonds and protein-ligand interactions.
  • Pfam Clan, provides a pop-up display of the Pfam clan accession, clan name, clan description and a list of all the Pfam clan members with the corresponding InterPro entry and InterPro name. The Pfam clan accession is linked to the Pfam clan page and the InterPro entries link to the respective InterPro entry. These InterPro entries will not necessarily be related to each other through PARENT/CHILD or CONTAINS/FOUND IN relationships.
  • Genome Properties, from TIGRFAMs. The GenProp ID's, when referenced, link to the Genome Properties system, which consists of a suite of Properties. These properties are carefully defined attributes of prokaryotic organisms whose status can be described by numerical values, or controlled vocabulary terms for complete sequenced genomes. The Genome Properties system has been designed to capture the widest possible range of attributes and currently encompasses taxonomic terms, genometric calculations, metabolic pathways, systems of interacting macromolecular components and quantitative and descriptive experimental observations (phenotypes) from the literature.

Interactions

Links from InterPro to IntAct are provided at the level of individual UniProtKB accessions only for manual curated protein:protein interactions. A limited set of 20 examples is provided. Each example links through to the IntAct page that describes the interaction: the UniProtKB accession numbers of the interacting proteins, the binding domains, experimental features and the InterPro entries of the interacting proteins.

Taxonomy Coverage

The Taxonomy Coverage aims to provide 'at a glance' view of the taxonomic range of the sequences associated with each InterPro entry and the number of sequences associated with each lineage. The taxonomic lineages are 'clickable' and provide a pop-up, which displays the tax-ID, the taxonomy and taxonomic subgroup(s)/species having matches to proteins, the protein match counts and a FASTA link. Clicking on the taxonomy or taxonomic subgroup(s)/species links to the protein overview matches for the selected taxonomy. Clicking on the FASTA box will download the complete set of FASTA sequences for the selected taxonomy of the entry.

The lineages were carefully selected to provide a view of the major groups of organisms. The circular display has the taxonomy-tree root as its centre. Selected model organisms populate the outer most circle. Nodes of the taxonomy-tree are placed on the inner circles. Radial lines lead to the description for each node. No significance is attached to the position of the node on a particular inner-circle, other than convenience, though some attempt has been made to group nodes. The nodes themselves are either true taxonomy nodes and have a NCBI taxonomy number or are artificial nodes created for this display; of which there are three: Unclassified, Other Eukaryotes (Non-Metazoa) and the Plastid Group.

Artificial Taxon: Unclassified contains the following NCBI taxon groups:

  • Taxonomy:12884 Viroids
  • Taxonomy:12908 unclassified sequences
  • Taxonomy:28384 other sequences

The Eukaryota (TAXONOMY:2759) comprises 29 taxons, these have been grouped into two artificial taxons and one existing taxon:

Fungi/Metazoa (TAXONOMY:33154); Node Metazoa

Artificial Taxon; Plastid Group, this contains the following NCBI taxon groups:

  • Taxonomy: 2763 Rhodophyta
  • Taxonomy: 2830 Haptophyceae
  • Taxonomy: 3027 Cryptophyta
  • Taxonomy: 33090 Viridiplantae
  • Taxonomy: 33630 Alveolata
  • Taxonomy: 33634 stramenopiles
  • Taxonomy: 33682 Euglenozoa
  • Taxonomy: 38254 Glaucocystophyceae
  • Taxonomy: 339960 Katablepharidophyta

Each taxonomic group within this artificial taxon contains organisms that have a plastid.

Artificial Taxon; Other Eukaryotes (Non-Metazoa), this comprises the following NCBI taxon groups:

  • Taxonomy: 5719 Parabasalidea
  • Taxonomy: 5752 Heterolobosea
  • Taxonomy: 66288 Oxymonadida
  • Taxonomy: 136087 Malawimonadidae
  • Taxonomy: 154966 Nucleariidae
  • Taxonomy: 193537 Centroheliozoa
  • Taxonomy: 207245 Diplomonadida group
  • Taxonomy: 543769 Rhizaria
  • Taxonomy: 554296 Apusozoa
  • Taxonomy: 554915 Amoebozoa
  • Taxonomy: 556282 Jakobida

Each taxonomic group within this artificial taxon are the remaining taxonomic groups of the NCBI taxon:2759, which are not in the Plastid Group and are not Fungi/Metazoa (TAXONOMY:33154).

Overlapping InterPro Entries

This section displays entries that share more than 70% of their proteins. Such overlaps define PARENT/CHILD and CONTAINS/FOUND IN relationships between InterPro entries.

IPR009007 Numbers of overlapping proteins Average numbers of overlapping amino acids
Related diagram


In the above example, InterPro entry IPR011969 contains proteins which are also found in IPR009007 as a result of the protein signatures of the two entries overlapping.

The two entries have been compared firstly by counting the number of proteins which are common to both, the results of which are displayed in the Venn diagram on the left, and secondly by calculating the average overlap of the protein signatures, in amino acids, with the results displayed in the bar diagram on the right.

Venn diagram display of the overlap of proteins common to both entries:

  • The purple intersection contains the number of overlapping proteins common to both IPR009007 and IPR011969, which is 31 in this case.
  • The pink section on the left is the number of proteins found in IPR009007 but not IPR011969, which is 35378.
  • The blue section on the right is the number of proteins found in IPR011969 but not IPR009007, which is 0; i.e. all proteins associated with IPR011969 occur in IPR009007.

Bar diagram display of the average amino acid overlap between the protein signatures:

The average number of amino acids overlapping in the sequences of the 31 proteins common to both entries is then calculated, with the results displayed in the bar diagram on the right. The bar diagram display is only shown for 'Domain - Domain' relationships.

  • The purple segment in the middle shows the average number of amino acids overlapping between IPR009007 and IPR011969 for the 31 proteins, in this case 104.
  • The pink segment shows the average number of amino acids found in IPR009007, but not IPR011969, for the 31 proteins, which is 0.
  • The blue segment shows the average number of amino acids found in IPR011969, but not IPR009007, for the 31 proteins, which is 15.

The results of these comparisons are used to calculate the percentage overlap score, with all scores greater than 70% displayed on the InterPro pages. In this example, since all proteins found in IPR011969 are also found in IPR009007, and all the amino acids from IPR009007 overlap with those from IPR011969, the percentage overlap score is 100%.

Examples

The protein entries in the examples illustrate as far as possible the diversity in taxonomy, structure and function of the proteins in the InterPro entry. For each example protein the accession number, UniProtKB name and a compact view of the matches is given.

Publications

The Publications field provides a list of references associated with each InterPro entry abstract. The list is originally derived from the reference lists of the member databases. Manual curation of abstracts often adds additional references.

PubMed Identifiers in the Publications and Additional Reading fields (below) now link to CiteXplore. The EBI's CiteXplore combines literature searching with text mining tools for biology. Search results are cross referenced to EBI applications based on publication identifiers. Links to full text versions are provided where available.

Additional Reading

The Additional Reading field provides a list of publications derived from the references provided by the member databases for the methods associated with each InterPro entry which are not referenced in the abstract. Additionally, a maximum of 5 references per entry are taken from the PDB where one or more of the proteins in the entry has had its structure determined.

Conventions Used in the Databank

General structure

InterPro is managed within a relational database system. Lists of protein matches for member database signatures are stored in the InterPro Oracle tables.

The subset of InterPro Oracle tables, which are available publicly, are prepared in two formats: Oracle and MySQL. In addition, the two tables from the GO relational database (INTERPRO2GO and TERMS) are included in the export files to provide functional compatibility with InterPro web server software. The files can be downloaded from InterPro ftp site:
ftp://ftp.ebi.ac.uk/pub/databases/interpro/database.
Oracle version 9.2.0.8.0 has been used for the export file preparation and parameters file import.par is available from the ftp site for the data file import.

InterPro information is publicly released as ASCII (text) flat files, written in XML: interpro.xml, match_complete.xml, feature.xml, uniparc_match.tar.gz and unimes_match.tar.gz. Match_complete.xml contains all UniProtKB sequences and those member database methods that match UniProtKB sequences that have not yet been integrated into InterPro. Due to the large size of UniParc and UniMES the data has been divided into chunks and the latest updates are provided in these files at each InterPro release.
Feature.xml provides curated SCOP, CATH and PDB matches to UniProtKB sequences in InterPro.

The database excluding matches is dumped to a disk. From this stage on, the information flow is strictly one way, so any modifications we make beyond this point are not reflected in Oracle.

The XML files excluding matches will validate against the schema. For those using parsers that do not support schema based validation, there is a derived DTD included as well. The schema diagram is available in Adobe pdf format from the documentation page.

The XML files may be downloaded from the EBI anonymous-ftp server:

ftp://ftp.ebi.ac.uk/pub/databases/interpro.

InterPro is accessible for interactive use via the EBI Web server:

http://www.ebi.ac.uk/interpro

An Oracle distribution of the InterPro database is now available from the ftp site. All the necessary files and information can be found in the oracle directory.

Update Procedures

Member databases are released monthly or quarterly, while UniProtKB is updated every three weeks. As a result, the matches provided by the member databases may be outdated and incomplete since they would have used older versions of UniProtKB. Matches for new and changed protein sequences are calculated against member databases after each release of UniProtKB. Changed protein sequences are recognised by a change of the CRC64 (cyclic redundancy checksum), which provides a unique identifier for a given protein sequence.
Major updates to InterPro occur when member databases have new releases. In this case new methods are integrated into InterPro and run over all proteins in UniProtKB.
Annotation updates occur regularly and are ongoing. All changes in InterPro are made on the production database, and are publicly visible after the maximum of 3 months. We will produce a new, updated XML file at the same time to keep InterPro data in SRS, which is based on the XML file, in synch with the database.

InterPro cross-references in UniProtKB are shown in the DR lines of the flat files from entries in these databases. These are updated regularly in UniProtKB, so there may be a lag between the update of matches in InterPro and those in the UniProtKB flat files.

Appendix - PROSITE

PROSITE consists of documentation entries describing protein domains, families and functional sites, as well as associated patterns and profiles to identify them. Profiles and patterns are constructed from manually edited seed alignments. PROSITE is complemented by ProRule, a collection of rules based on profiles and patterns, which increases the discriminatory power of profiles and patterns by providing additional information about functionally and/or structurally critical amino acids.

Please cite:-

Hulo N, Bairoch A, Bulliard V, Cerutti L, Cuche BA, de Castro E, Lachaize C, Langendijk-Genevaux PS, Sigrist CJA. (2008)
The 20 years of PROSITE.
Nucleic Acids Res. 36, 245-249.

For more information see:- http://www.expasy.ch/prosite

Appendix - HAMAP

Members of HAMAP families are identified using PROSITE profile collections. HAMAP profiles are manually created by expert curators and they identify proteins that are part of well-conserved bacterial, archaeal and plastid-encoded proteins families or subfamilies. The aim of HAMAP is to propagate manually generated annotation to all members of a given protein family in an automated and controlled way using very strict criteria.

Please cite:-

Tania Lima, Andrea H. Auchincloss, Elisabeth Coudert, Guillaume Keller, Karine Michoud, Catherine Rivoire, Virginie Bulliard, Edouard de Castro, Corinne Lachaize, Delphine Baratin, Isabelle Phan, Lydie Bougueleret and Amos Bairoch (2009)
HAMAP: a database of completely sequenced microbial proteome sets and manually curated microbial protein families in UniProtKB/Swiss-Prot
Nucleic Acids Res. 37, Database issue D471-D478

Appendix - Pfam

Pfam is a collection of protein family alignments which were constructed semi-automatically using hidden Markov models (HMMs). Sequences that were not covered by Pfam were clustered and aligned automatically, and are released as Pfam-B. Pfam families have permanent accession numbers and contain functional annotation and cross-references to other databases, while Pfam-B families are re-generated at each release and are unannotated.

Please cite:-

Bateman, A., Birney, E., Cerruti, L., Durbin, R., Etwiller, L., Eddy, S.R., Griffiths-Jones, S., Howe, K.L., Marshall, M., Sonnhammer, E.L.L. (2002)
The Pfam Protein Families Database.
Nucleic Acids Res. 30, 276-280.

For more information see:-
http://pfam.sanger.ac.uk/

Appendix - PRINTS

PRINTS is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family; its diagnostic power is refined by iterative scanning of OWL. Usually the motifs do not overlap, but are separated along a sequence, though they may be contiguous in 3D-space. Fingerprints can encode protein folds and functionalities more flexibly and powerfully than can single motifs: the database thus provides a useful adjunct to PROSITE.

Please cite:-

Attwood, T.K., Bradley, P., Flower, D.R., Gaulton, A., Maudling, N., Mitchell, A.L., Moulton, G., Nordle, A., Paine, K., Taylor, P., Uddin, A., Zygouri, C. (2003)
PRINTS and its automatic supplement, prePRINTS.
Nucleic Acids Res. 31, 400-402.

For more information see:- http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/

Appendix - ProDom

The ProDom protein domain database consists of an automatic compilation of homologous domains. Current versions of ProDom are built using a novel procedure based on recursive PSI-BLAST searches. The ProDom database has been designed as a tool to help analyze domain arrangements of proteins and protein families. Strong emphasis has been put on the graphical user interface which allows for interactive analysis of protein homology relationships.

Please cite:-

Corpet F, Servant F, Gouzy J, Kahn D (2000)
ProDom and ProDom-CG: Tools for protein domain analysis and whole genome comparisons,
Nucleic Acids Res. 28:267-269.

For more information see:- http://www.toulouse.inra.fr/prodom.html

Appendix - SMART

SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures. More than 500 domain families found in signalling, extracellular and chromatin-associated proteins are detectable. These domains are extensively annotated with respect to phyletic distributions, functional class, tertiary structures and functionally important residues. Each domain found in a non-redundant protein database as well as search parameters and taxonomic information are stored in a relational database system. User interfaces to this database allow searches for proteins containing specific combinations of domains in defined taxa.

Please cite:-

Letunic, I., Goodstadt, L., Dickens, N.J., Doerks, T., Schultz, J., Mott, R., Ciccarelli, F., Copley, R.R., Ponting, C.P., Bork, P. (2002)
Recent improvements to the SMART domain-based sequence annotation resource.
Nucleic Acids Res. 30, 242-244.

For more information see:- http://smart.embl-heidelberg.de/

Appendix - TIGRFAMs

TIGRFAMs is a collection of protein families, featuring curated multiple sequence alignments, hidden Markov models (HMMs) and annotation, which provides a tool for identifying functionally related proteins based on sequence homology. Those entries which are "equivalogs" group homologous proteins which are conserved with respect to function.

Please cite:-

Haft, D.H., Selengut, J.D., White, O. (2003)
The TIGRFAMs database of protein families.
Nucleic Acids Res. 31, 371-373.

For more information see:- http://www.jcvi.org/cms/research/projects/tigrfams/overview/

Appendix - PIRSF

The PIRSF protein classification system is a network with multiple levels of sequence diversity from superfamilies to subfamilies that reflects the evolutionary relationship of full-length proteins and domains. The primary PIRSF classification unit is the homeomorphic family, whose members are both homologous (evolved from a common ancestor) and homeomorphic (sharing full-length sequence similarity and a common domain architecture).

Please cite:-

Wu CH, Yeh LL, Huang H, Arminski L, Castro-Alvear J, Chen Y, Hu Z, Ledley RS, Kourtesis P, Suzek BE, Vinayaka CR, Zhang J, Barker WC. (2003)
The Protein Information Resource.
Nucleic Acids Res. 31, 345-347.

For more information see:- http://pir.georgetown.edu/pirwww/dbinfo/pirsf.shtml

Appendix - SUPERFAMILY

SUPERFAMILY is a library of profile hidden Markov models that represent all proteins of known structure. The library is based on the SCOP classification of proteins: each model corresponds to a SCOP domain and aims to represent the entire SCOP superfamily that the domain belongs to. SUPERFAMILY has been used to carry out structural assignments to all completely sequenced genomes. The results and analysis are available from the SUPERFAMILY website.

Please cite:-

Gough, J., Karplus, K., Hughey, R. and Chothia, C. (2001)
Assignment of Homology to Genome Sequences using a Library of Hidden Markov Models that Represent all Proteins of Known Structure.
J. Mol. Biol., 313, 903-919.

For more information see:- http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/

Appendix - Gene3D

Gene3D is a library of hidden Markov models that represent all proteins of known structure. The seed alignments for the models are derived from the proteins found within the homologous superfamily (H-level) classification level in CATH, which groups together domains that are thought to share a common ancestor. In CATH, similarities at H-level are identified first by sequence comparisons and subsequently by structure comparisons using SSAP. Gene3D has been used to carry out structural assignments to all completely sequenced genomes. The results and analysis are available from the Gene3D website.

With InterPro release 12.2 the Gene3D method accession number has changed to G3DSA: (note the colon). G3DSA is a specific subset of G3D data that is relevant to InterPro and permits direct links to the relevant pages in CATH.

Please cite:-

Frances Pearl, Annabel Todd, Ian Sillitoe, Mark Dibley, Oliver Redfern, Tony Lewis, Christopher Bennett, Russell Marsden, Alistair Grant, David Lee, Adrian Akpor, Michael Maibaum, Andrew Harrison, Timothy Dallman, Gabrielle Reeves, Ilhem Diboun, Sarah Addou, Stefano Lise, Caroline Johnston, Antonio Sillero, Janet Thornton, and Christine Orengo (2005)
The CATH Domain Structure Database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis.
Nucleic Acids Res. 33, Database Issue:D247-D251.

For more information see:- http://www.cathdb.info

Appendix - PANTHER

PANTHER HMMs define protein families, and subfamilies modelled on the divergence of specific functions within the families; this permits more accurate association with function based on ontology terms and pathways, as well as inference of amino acids important for functional specificity.

Please cite:-

Huaiyu Mi, Betty Lazareva-Ulitsky, Rozina Loo, Anish Kejariwal, Jody Vandergriff, Steven Rabkin, Nan Guo, Anushya Muruganujan, Olivier Doremieux, Michael J. Campbell, Hiroaki Kitano, and Paul D. Thomas (2005)
The PANTHER database of protein families, subfamilies, functions and pathways.
Nucleic Acids Res. 33, Database Issue:D284-288.

For more information see:- http://www.pantherdb.org

Appendix - HLRN

A large proportion of HMMER-based calculations are performed on the IBMP690 Supercomputer at HLRN. We would like to thank Dr. Steffen Schulze-Kremer and the HLRN staff for their continued and valuable assistance.

spacer
InterPro 35.0