0%

InterPro data

InterPro has 13 member databases, each of which uses a different method to classify proteins.

InterPro curators manually integrate protein signatures from member databases, merging signatures that represent the same protein family, domain or site into single InterPro entries. Where possible, they also trace biological relationships between entries. They check the biological accuracy of the individual signatures and add pertinent information, including consistent entry names, descriptive abstracts, links to the biomedical literature and Gene Ontology terms. Links are also made to various other databases and bioinformatic tools, such as UniProt, AlphaFold, FoldseekENZYME and PDBe.

Figure 1 provides an overview of the data sources used to construct InterPro.

Figure 1 An overview of the data sources used to construct InterPro.

Member databases

The following databases contribute data to InterPro:

  • CDD at NCBI, Bethesda, USA
  • PANTHER at University of Southern California, CA, USA
  • PIRSF at the Protein Information Resource, Georgetown University Medical Centre, Washington DC, USA
  • Pfam at the EMBL-EBI, Hinxton, UK (hosted on the InterPro website)
  • PRINTS at the University of Manchester, UK, (retired, now hosted on the InterPro website)
  • PROSITE and HAMAP at the Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland
  • SFLD at the University of California, San Francisco, USA (retired, now hosted on the InterPro website)
  • SMART at EMBL, Heidelberg, Germany
  • NCBIfam (contains TIGRFAM) at the National Center for Biotechnology Information, Bethesda, MD.
Protein family, domain, repeat or site are the InterPro entry types that represent signatures from all the contributing databases mentioned above.
CATH-Gene3D at University College, London, UK

The CATH (Class, Architecture, Topology, Homologous superfamily) database provides a hierarchical domain classification for 3D structures of proteins deposited in the PDB. CATH uses a semi-automated procedure to classify protein domains into four hierarchical levels such as Class (C-level), Architecture (A-level), Topology or fold groups (T-level) and Homologous Superfamily (H-level). Gene3D provides comprehensive structural domain assignments and functional annotation for sequences of proteins, available from major protein sequence databases such as UniProt, RefSeq, Integr8 and Ensembl. It generates a library of Hidden Markov models (profile- HMMs) from CATH domain sequences using HMMER3 and scans them against various protein sequence databases.

SUPERFAMILY at the University of Cambridge, UK

The SUPERFAMILY database provides SCOP structural domain annotation of protein sequences at the superfamily level using a library of Hidden Markov models (HMMs). The protein domains at the superfamily level in SCOP groups together the most distantly related proteins which have a common evolutionary ancestor and are useful for remote homology detection. In 2018, the SUPERFAMILY HMM library 2.0 was built by expanding the HMM library 1.75 to include domain sequences taken from the structural domain database SCOPe, CATH, ECOD and full length PDB sequences. However, this last update has been propagated to InterPro.

The Homologous superfamily entries in InterPro contain signatures from the CATH-Gene3D and SUPERFAMILY resources exclusively. These two databases utilise a collection of underlying profile hidden Markov models (HMMs) to represent diverse structural families. This methodology makes them often match wider sets of proteins and makes them difficult to integrate with other member database signatures.