Content
The content of the current release; statistics, lists of the entries by Type and matches to member database methods (both integrated and unintegrated) are provided in the Release Notes.
Previous releases are listed in the Documentation.
![]() |
User ManualTo cite InterPro please use:-
Sarah Hunter; Philip Jones; Alex Mitchell; Rolf Apweiler; Teresa K. Attwood; Alex Bateman; Thomas
Bernard; David Binns; Peer Bork; Sarah Burge; Edouard de Castro; Penny Coggill; Matthew Corbett;
Ujjwal Das; Louise Daugherty; Lauranne Duquenne; Robert D. Finn; Matthew Fraser; Julian Gough;
Daniel Haft; Nicolas Hulo; Daniel Kahn; Elizabeth Kelly; Ivica Letunic; David Lonsdale; Rodrigo
Lopez; Martin Madera; John Maslen; Craig McAnulla; Jennifer McDowall; Conor McMenamin; Huaiyu Mi;
Prudence Mutowo-Muellenet; Nicola Mulder; Darren Natale; Christine Orengo; Sebastien Pesseat; Marco
Punta; Antony F. Quinn; Catherine Rivoire; Amaia Sangrador-Vegas; Jeremy D. Selengut; Christian J.
A. Sigrist; Maxim Scheremetjew; John Tate; Manjulapramila Thimmajanarthanan; Paul D. Thomas; Cathy
H. Wu; Corin Yeats; Siew-Yit Yong
InterPro FundingCurrent InterPro Funding:
InterPro is currently funded by grant number 213037 from the European Union under the program "FP7 capacities: Scientific Data Repositories". The working title for the project is IMproving Protein Annotation and Co-ordination using Technology (IMPACT). InterPro is also funded by grant BB/F010508/1 from the BBSRC Bioinformatics and Biological Resources Fund. Previous InterPro Funding: InterPro was funded by the award of grant number QLRI-CT-2000-00517 and in part by grant number QLRI-CT-2001000015 from the European Union under the RTD program "Quality of Life and Management of Living Resources". InterPro was also part of the MRC-funded eFamily project. What is InterPro?InterPro is an integrated documentation resource for protein families, domains, regions and sites. InterPro combines a number of databases (referred to as member databases) that use different methodologies and a varying degree of biological information on well-characterised proteins to derive protein signatures. By uniting the member databases, InterPro capitalises on their individual strengths, producing a powerful integrated database and diagnostic tool (InterProScan). The member databases use a number of approaches:
Diagnostically, these resources have different areas of optimum application owing to the different underlying analysis methods. In terms of family coverage, the protein signature databases are similar in size but differ in content. While all of the methods share a common interest in protein sequence classification, some focus on divergent domains (e.g., Pfam), some focus on functional sites (e.g., PROSITE), and others focus on families, specialising in hierarchical definitions from superfamily down to subfamily levels in order to pin-point specific functions (e.g., PRINTS). TIGRFAMs focus on building HMMs for functionally equivalent proteins and PIRSF always produce HMMs over the full length of a protein and have protein length restrictions to gather family members. HAMAP profiles are manually created by expert curators they identify proteins that are part of well-conserved bacterial, archaeal and plastid-encoded proteins families or subfamilies. PANTHER build HMMs based on the divergence of function within families. SUPERFAMILY and Gene3D are based on structure using the SCOP and CATH superfamilies, respectively, as a basis for building HMMs. Integration into InterProSignatures describing the same protein family or domain are grouped into unique InterPro entries. Each combined InterPro entry has a unique accession number, an abstract describing the features of proteins associated with the entry, literature references and has links to the relevant member database(s). All UniProtKB protein sequences that have matches to a particular InterPro entry are listed in the Match Table associated with that entry. There are also links to the InterPro graphical views. ContentThe content of the current release; statistics, lists of the entries by Type and matches to member database methods (both integrated and unintegrated) are provided in the Release Notes. Previous releases are listed in the Documentation. How is it useful?Protein signature databases have become vital tools for identifying distant relationships in novel sequences and hence are used for the classification of protein sequences and for inferring their function. InterPro streamlines the analysis of newly determined sequences for the individual user and makes a significant contribution to the demanding task of automatic annotation of predicted proteins from genome sequencing projects. InterPro provides internal consistency checks and deeper coverage, making it more efficient and reliable than using each of the pattern databases separately. This unified approach improves both the utility and the coverage of pattern databases, pin-pointing weaknesses and facilitating their further development. Structure of an InterPro EntryThe information fields (or sections) in bold are the core fields, while those in italics are only displayed when applicable; all information field headers are linked to the relevant section in the User Manual. The fields in an entry are: Header: Accession number and Name, UniProtKB match count and match views, Accession: accession number and short name, Secondary accession number, Type, Signatures, Parent, Children, Found in, Contains, Gene Ontology Terms, Abstract, Structural Links, Database Links, IntAct Links, Taxonomy coverage, Overlapping InterPro entries, Example proteins, Publications and Additional Reading. Entry Header: Accession Number and NameEvery InterPro entry has an accession number of the form IPRXXXXXX, where X is a digit. The accession number provides a stable way of identifying InterPro entries. InterPro accession numbers are stable and therefore allow unambiguous citation of database entries. The InterPro entry name describes the InterPro entry and should give an idea of the type of protein matches for that entry. UniProtKB Matches and Match ViewsUniProtKB match counts are calculated for all UniProtKB proteins, which are a combination of information from UniProtKB/Swiss-Prot, UniProtKB/TrEMBL and PIR. For more information go to the UniProtKB home page. Match lists give a number of different views of the signature matches on the sequences in each InterPro entry. These include:
Match information includes the protein sequence accession number, the accession number of the signature and the position of the signature on the protein sequence. Match StatusFor all member database methods, with the exception of PROSITE patterns, matches to UniProtKB proteins are considered to be TRUE if the score is above the individual threshold(s) given by the member database and are thus flagged as T and are displayed. Matches falling below the set threshold are not displayed. For PROSITE pattern matches to UniProt/Swiss-Prot sequences where the match was of status TRUE, the sequences surrounding the pattern match were used, in addition to the pattern, to construct 'miniprofiles'. If a match to the miniprofile was above its set threshold then the match to the PROSITE pattern was considered to be TRUE and the match displayed, otherwise matches are not displayed, irrespective of their manual match status in UniProtKB/Swiss-Prot. For a full description of PROSITE manual status' see UniProt Knowledgebase user manual and for a full description of the use of miniprofiles in evaluating PROSITE pattern matches see Nucleic Acids Res. 38 (Database Issue):D245-D249. Protein Match ViewsThe protein matches for InterPro entries can be viewed in a number of different ways. For each match display there are various options:
More details on each option are listed below. Select
RefineThese options refine the selection.
Display
SortProteins are sorted by database category - UniProtKB/Swiss-Prot followed by UniProtKB/TrEMBL, and then by alphabetically by UniProtKB accession or ID. Taxonomy LineageA collapsed taxonomy lineage is displayed for the proteins associated with the InterPro entry that is sorted alphabetically by taxonomic group and then by species name. The taxonomy lineages are 'clickable'; clicking on a particular lineage returns only the protein matches for the selected taxonomy and its underlying phylogeny; species being sorted and displayed alphabetically.IndexIndex.UniProtKB: S=Swiss-Prot T=TrEMBL If there are more than 25 proteins on one page they are split into groups, using the sort order. The index is shown on the left side of the page. Click on each section to view subsets of the selected proteins. Display: CompactEach protein is represented as a scaled horizontal line, the protein match line, along which vertical lines are drawn at 10, 20, 50, 100, 200 or 500 amino acid intervals, depending on the length of the protein. The scale is shown to the left of the match graphics. Coloured bars are displayed along the protein match line to indicate where in a protein matches were found among the InterPro entries. The bar is coloured according to which InterPro entry matched that region of the protein. If multiple InterPro entries match at the same point on an protein match line their match line boxes will be displayed one above the other - the vertical position of a match line box and its colour has no significance. Hovering the mouse over a coloured bar will show the InterPro entry accession (linked to that entry), the entry name and the position of the match on the protein. In addition to matches to InterPro entries, matches to curated structural data, CATH, SCOP, and PDB and to non-curated predicted structural elements defined by SWISS-MODEL and MODBASE are also displayed. The matches to these structural models have fixed colours with white striped lines. The key near the bottom of the page to identifies which colours correspond to which InterPro entries or structural features. Matches to unintegrated methods are not displayed. Display: DetailedEach protein is represented as a scaled horizontal line, the protein match line, along which vertical lines are drawn at 10, 20, 50, 100, 200 or 500 amino acid intervals, depending on the length of the protein. The scale is shown to the left of the match graphics. For single protein views, the table of match positions and the InterPro Architecture (coloured lozenges) are also displayed. Each member database signature with a match to the protein sequence is displayed on a horizontal line. The position of the match reflects its position on the sequence. Hovering the mouse over a coloured bar will show the method database accession number and the residues corresponding to the position of the match on the protein. Where a signature matches a protein more than once and if the matches overlap a single pop-up box provides details of the overlapping matches including the residues corresponding to the position of each match on the protein. The mid-point of each overlap is indicated by a notch on the match bar. The member database accession number is linked to the member database summary page for that signature and the name of the signature is given in the far right column. Each member database signature is identified by a specific colour:
The key table near the bottom of the page identifies which colours correspond to which member database method or structural feature. Unintegrated Signatures and Signature PagesEach member database method not integrated into InterPro that has a match to the protein sequence is displayed on a horizontal line. The position of the match reflects its position on the sequence. Hovering the mouse over a coloured bar will show the method database accession number and the residues corresponding to the position of the match on the protein. The accession number is linked to the relevant page of the member database. 'Unintegrated' is linked to the Unintegrated Signature Page. The Unintegrated Signature Page provides the signature accession number and its name, which defaults to the accession number in the absence of a name and the number of proteins the signature matches, with a link to the detailed view of protein matches. Signatures that are under review are withdrawn from InterPro. Where the removal of the signature would lead to the loss of the InterPro entry, the entry is flagged 'Not for release' and although the entry accession number is provided in the signature lists, no link to the entry page is available. The protein matches to these signatures are not currently computed and a 0 value is returned and displayed on the signature pages. Structural FeaturesStructural information is presented for those proteins with a structure in the PDB and for those whose structure is predicted from automated homology-modelling. It is represented on sequences as coloured white striped bars:
In the Detailed Graphical View the structural matches are listed at the bottom of the view. Entries are sorted by UniProtKB accession number. The UniProtKB accession number links to the UniProtKB entry record, The GO! link returns the associated GO terms for the protein based on mappings to GO via all sources (see QuickGo). The 'Structure' link returns the PDBe protein entry page for that PDB entry. The links for the SCOP and CATH classification hierarchies, and the domain ID is a link to the domain itself (this ID contains the PDB identifier). For proteins with a predicted structure feature based on either SWISS-MODEL or MODBASE a link to the SWISS-MODEL Repository or MODBASE is provided through the protein accession number. WARNING: SWISS-MODEL and MODBASE models are theoretically calculated structures, not experimentally determined structures. Therefore the models may contain significant errors. For those proteins with a known structure clicking on the Astex icon, [ Display: TableThe Table view displays InterPro methods of this entry, and structural features. The matches for proteins (listed down the left hand side of the table, sorted by UniProtKB accession number) are shown for the match methods (listed along the top of the table). The numbers represent positional coordinates of the match on the protein sequence. Matches can be filtered to all proteins or only proteins with known structure in the bulleted list. Both views are paginated at 25 proteins per page. Display: ProteinsKey to symbols which may appear in the protein label on the left hand side of the match display for Compact, Detailed and Table views.
UniProtKB: Links to UniProtKB record for that protein. * UniProtKB fragments with FT NON_CONS and FT NON_TER features.
NON_CONS fragments are not indicated as non-consecutive in InterPro and being non-consecutive the match to methods may be incorrect if the method spans the 'break'. Display: ArchitecturesThe InterPro Domain Architecture (IDA) Viewer is a graphical representation of protein domain architecture, where the domain architecture of a protein sequence is displayed as a series of non-overlapping domains. The domain architecture is derived from members database signatures that do not overlap on a protein sequence. They are depicted as coloured lozenges; the lozenge name derives from the InterPro entry short name and if repeated more than once, then the multiplication factor follows the short name e.g. - x2. In the example below there is one PAN domain, four Kringle domains followed by one Peptidase S1/S6 domain. In some instance the architectural domains can be nested, one inside the other, for example IDA1254 and IDA719 (1254[719]). ![]() Entries are grouped by domain architecture. A count of the number of UniProtKB proteins with the same architecture is provided, an example and the IDA code for each group is also given in the left-hand column. The example is linked to the single-protein detailed view. The IDA code is linked to the compact view for all entries with this domain architecture. There is no relationship between the length of the architecture display and the actual protein length; neither is there any relationship between the length of the lozenge and the length of the member database signatures that they depict. The key table near the bottom of the page identifies which architecture lozenge colours correspond to which InterPro entry. Accession Number and Short NameEvery InterPro entry has an accession number of the form IPRXXXXXX, where X is a digit. The accession number provides a stable way of identifying InterPro entries. InterPro accession numbers are stable and therefore allow unambiguous citation of database entries. The short name is a short, concise name unique to each InterPro entry. SecondaryAccession numbers provide a stable way of identifying InterPro entries from release to release. When the signatures in an InterPro entry are split or merged to give new or modified entries, then the accession number of the original InterPro entry becomes the secondary accession number in the new or modified InterPro entry. In a recent change accession numbers are now linked to methods so any accession number that has been associated with a method will become a secondary accession number in the entry in which the method currently appears. In this way it will be possible to trace movement of methods through splitting and merging of entries. TypeType defines the entry as a Family, Domain, Region, Repeat or Site. Sites are sub-classified into either Conserved Sites (includes Motifs), Active Sites, Binding Sites or PTMs (Post-translational Modifications). InterPro has introduced a new Type, Region, and new entry classification rules that affect the typing of entries:
Features of Sequences: Repeats and SitesInterPro Repeat A repeat is a region that is not expected to fold into a globular domain on its own. For example 6-8 copies of the WD40 repeat are needed to form a single globular domain. There are also many other short repeat motifs that probably do not form a globular fold. InterPro Sites:
Lists of entry types:Family; Domain; Region; Repeat; Conserved Site; Active Site; Binding Site; PTMs. SignaturesThe Signatures field lists the protein signature matches. For each protein signature the Member database, the signature ID, signature name and number of proteins it matches are given. The member database names are linked to their respective home page and the signature IDs are linked to the corresponding entry information page. Gene Ontology AssignmentsFunctional classification of the entry is given by listing associated GO terms. The Gene Ontology project (GO) http://www.geneontology.org/ is a dynamic controlled vocabulary defined in three ontology's, molecular function, biological process and cellular component.
For each associated term the name of the term and GO accession number is given. The assignment of GO terms to InterPro entries was done manually by reading the abstract of the entries and annotation of proteins in the protein match table for each entry. An appropriate GO term for an entry is one, which applies to the whole protein. The GO terms associated with an InterPro entry applies to all proteins with true hits to the signatures in that entry. The assignments are incomplete and are ongoing due to the dynamic nature of the GO project. Some entries could be mapped to very low level (specific) GO terms, while entries describing wider families or common domains were mapped to higher level terms or could not be mapped at all. The GO terms and mappings can be found using the EBI QuickGo browser. It is important to remember these mappings provide useful predictions of GO assignments to the corresponding proteins however, biological exceptions like inactivated enzymes may occur. GO annotation is an ongoing process as new entries are incorporated and old entries are updated. AbstractThe Abstract describes the signatures in the entry, the protein matches and provides references. Where possible a functional inference is made. Structural LinksLinks to CATH, SCOP and PDB are displayed when proteins in the entry have known structures. The PDB links are displayed within a pop-up box. Structural links are generated automatically to the CATH and SCOP databases through residue-by-residue mappings with UniProtKB proteins. These databases describe the structural architecture of proteins, placing them within hierarchical classification schemes; InterPro entries are mapped to the 'Homologous Superfamily' level of CATH and to the 'Superfamily' level of SCOP. However, only those CATH- and SCOP-defined domains that are based on structural data describing single UniProtKB entries (as opposed to chimeric domains composed of multiple UniProtKB entries), and which overlap with InterPro signatures, are integrated into InterPro. Structural links are also provided to the Protein Databank in Europe (PDBe), via PDBe; PDB being the repository for crystallographic and NMR structures. These links are to all the PDBe entries for proteins that match the InterPro entry, provided they cover the signatures within the entry. It is important to stress that the structural links field contains curated mappings of SCOP and CATH domains to UniProtKB protein sequences. These, together with the PDBe data, are displayed as separate lines in the graphical view of protein matches, and should not be confused with the predictions of these domains as represented by the signatures. Database LinksDatabase links include, cross-references to:
InteractionsLinks from InterPro to IntAct are provided at the level of individual UniProtKB accessions only for manual curated protein:protein interactions. A limited set of 20 examples is provided. Each example links through to the IntAct page that describes the interaction: the UniProtKB accession numbers of the interacting proteins, the binding domains, experimental features and the InterPro entries of the interacting proteins. Taxonomy CoverageThe Taxonomy Coverage aims to provide 'at a glance' view of the taxonomic range of the sequences associated with each InterPro entry and the number of sequences associated with each lineage. The taxonomic lineages are 'clickable' and provide a pop-up, which displays the tax-ID, the taxonomy and taxonomic subgroup(s)/species having matches to proteins, the protein match counts and a FASTA link. Clicking on the taxonomy or taxonomic subgroup(s)/species links to the protein overview matches for the selected taxonomy. Clicking on the FASTA box will download the complete set of FASTA sequences for the selected taxonomy of the entry. The lineages were carefully selected to provide a view of the major groups of organisms. The circular display has the taxonomy-tree root as its centre. Selected model organisms populate the outer most circle. Nodes of the taxonomy-tree are placed on the inner circles. Radial lines lead to the description for each node. No significance is attached to the position of the node on a particular inner-circle, other than convenience, though some attempt has been made to group nodes. The nodes themselves are either true taxonomy nodes and have a NCBI taxonomy number or are artificial nodes created for this display; of which there are three: Unclassified, Other Eukaryotes (Non-Metazoa) and the Plastid Group. Artificial Taxon: Unclassified contains the following NCBI taxon groups:
The Eukaryota (TAXONOMY:2759) comprises 29 taxons, these have been grouped into two artificial taxons and one existing taxon: Fungi/Metazoa (TAXONOMY:33154); Node Metazoa Artificial Taxon; Plastid Group, this contains the following NCBI taxon groups:
Each taxonomic group within this artificial taxon contains organisms that have a plastid. Artificial Taxon; Other Eukaryotes (Non-Metazoa), this comprises the following NCBI taxon groups:
Each taxonomic group within this artificial taxon are the remaining taxonomic groups of the NCBI taxon:2759, which are not in the Plastid Group and are not Fungi/Metazoa (TAXONOMY:33154). Overlapping InterPro EntriesThis section displays entries that share more than 70% of their proteins. Such overlaps define PARENT/CHILD and CONTAINS/FOUND IN relationships between InterPro entries.
In the above example, InterPro entry IPR011969 contains proteins which are also found in IPR009007 as a result of the protein signatures of the two entries overlapping. The two entries have been compared firstly by counting the number of proteins which are common to both, the results of which are displayed in the Venn diagram on the left, and secondly by calculating the average overlap of the protein signatures, in amino acids, with the results displayed in the bar diagram on the right. Venn diagram display of the overlap of proteins common to both entries:
Bar diagram display of the average amino acid overlap between the protein signatures: The average number of amino acids overlapping in the sequences of the 31 proteins common to both entries is then calculated, with the results displayed in the bar diagram on the right. The bar diagram display is only shown for 'Domain - Domain' relationships.
The results of these comparisons are used to calculate the percentage overlap score, with all scores greater than 70% displayed on the InterPro pages. In this example, since all proteins found in IPR011969 are also found in IPR009007, and all the amino acids from IPR009007 overlap with those from IPR011969, the percentage overlap score is 100%. ExamplesThe protein entries in the examples illustrate as far as possible the diversity in taxonomy, structure and function of the proteins in the InterPro entry. For each example protein the accession number, UniProtKB name and a compact view of the matches is given. PublicationsThe Publications field provides a list of references associated with each InterPro entry abstract. The list is originally derived from the reference lists of the member databases. Manual curation of abstracts often adds additional references. PubMed Identifiers in the Publications and Additional Reading fields (below) now link to CiteXplore. The EBI's CiteXplore combines literature searching with text mining tools for biology. Search results are cross referenced to EBI applications based on publication identifiers. Links to full text versions are provided where available. Additional ReadingThe Additional Reading field provides a list of publications derived from the references provided by the member databases for the methods associated with each InterPro entry which are not referenced in the abstract. Additionally, a maximum of 5 references per entry are taken from the PDB where one or more of the proteins in the entry has had its structure determined. Conventions Used in the DatabankGeneral structureInterPro is managed within a relational database system. Lists of protein matches for member database signatures are stored in the InterPro Oracle tables. The subset of InterPro Oracle tables, which are available publicly, are prepared in two formats: Oracle and
MySQL.
In addition, the two tables from the GO relational database (INTERPRO2GO and TERMS) are included in the export
files to provide functional compatibility with InterPro web server software. The files can be downloaded from
InterPro ftp site:
InterPro information is publicly released as ASCII (text) flat files, written in XML: interpro.xml,
match_complete.xml,
feature.xml, uniparc_match.tar.gz and unimes_match.tar.gz.
Match_complete.xml contains all UniProtKB
sequences
and those member database methods that match UniProtKB sequences that have not yet been integrated into
InterPro.
Due to the large size of UniParc and UniMES the data has been divided into chunks and the
latest updates are provided in these files
at each InterPro release.
The database excluding matches is dumped to a disk. From this stage on, the information flow is strictly one way, so any modifications we make beyond this point are not reflected in Oracle. The XML files excluding matches will validate against the schema. For those using parsers that do not support schema based validation, there is a derived DTD included as well. The schema diagram is available in Adobe pdf format from the documentation page. The XML files may be downloaded from the EBI anonymous-ftp server: ftp://ftp.ebi.ac.uk/pub/databases/interpro.
InterPro is accessible for interactive use via the EBI Web server: An Oracle distribution of the InterPro database is now available from the ftp site. All the necessary files and information can be found in the oracle directory. Text - and Sequence-based SearchesInterPro searchThe search box is made available on the front page and at the top of all other web pages; inputs include:
Searches are case sensitive. The fields searched are: name, short name, abstract, signature name, cross references and GO terms. In addition to returning the results of the search. A perfect match to the search query is high lighted in yellow. The statistics for the current release are also returned. Advanced searchThe Advanced Search option, linked to the 'Search' term, is available from the left hand menu bar, it includes the following options: Search entries
The Advanced Search permits word pairs or phrase searches to be used, e.g. GELATIN-BINDING REGION. All searches are scored with the best match being returned first. InterPro accession: returns the InterPro entry given an InterPro accession number. Protein accession/ID(s): returns the single protein view for a single supplied UniProtKB accession. Multiple protein accessions or IDs can be provided as a comma separated list.Entry of type: returns the list of all entries of the selected type. Web ServicesRetrieval and analysis of InterPro data is now available through the EBI's Web Services. Web Services is an integration technology. To ensure software from various sources work well together, this technology is built on open standards such as Simple Object Access Protocol (SOAP), a messaging protocol for transporting information; Web Services Description Language (WSDL), a standard method of describing Web Services and their capabilities. For the transport layer itself, Web Services utilise most of the commonly available network protocols, especially Hypertext Transfer Protocol (HTTP). The link dbfetch explains how to retrieve the data. Currently we support entry based abstract, PDB links, abstract, match and GO mapping. Both Perl and Java clients can be used at the command line to retrieve the data.Querying InterPro using InterProScanSequence-based queries are performed using InterProScan, a tool that combines the different protein signature recognition methods native to the InterPro member databases into one resource. InterProScan is perl-based and can either be run via the EBI web interface; via EBI Web Services or a stand-alone version can be downloaded from the ftp server and installed locally. The EBI web InterProScan currently has the ability to analyse 1 nucleotide coding sequence, which is translated in 6 frames. Protein sequence submissions should be limited to one sequence when querying through this interface. Please contact InterProScan support (interhelp@ebi.ac.uk) for help in submitting multiple sequences or, alternatively, download the stand-alone version. Sequences can either be cut and pasted into the large text window or uploaded as a file.
Partially formatted sequences will not be accepted. Copying and Pasting directly from word processors may yield unpredictable results as hidden/control characters may be present. Adding a return to the end of the sequence may help certain applications understand the input. To speed up analysis, a CRC64 check is performed. From the graphical display of the results, links go to the method signature pages and to the InterPro view of the matches. New features of InterProScan are listed separately in the program's release notes. The Release Notes contain the latest information about new features and usually consist of extension of the package to include new member database signatures or alterations to how results are processed. Go to 'InterProscan' FTP site for further information. Querying InterPro using BioMArtFor full details please see the BioMart usermanual.Update Procedures
Member databases are released monthly or quarterly, while UniProtKB is updated every three weeks. As a result,
the
matches provided by the member databases may be outdated and incomplete since they would have used older
versions of UniProtKB. Matches for new and changed protein sequences are calculated against member databases
after each release of UniProtKB.
Changed protein sequences are recognised by a change of the CRC64 (cyclic redundancy checksum), which provides a
unique identifier for a given protein sequence.
InterPro cross-references in UniProtKB are shown in the DR lines of the flat files from entries in these databases. These are updated regularly in UniProtKB, so there may be a lag between the update of matches in InterPro and those in the UniProtKB flat files. Appendix - PROSITEPROSITE consists of documentation entries describing protein domains, families and functional sites, as well as associated patterns and profiles to identify them. Profiles and patterns are constructed from manually edited seed alignments. PROSITE is complemented by ProRule, a collection of rules based on profiles and patterns, which increases the discriminatory power of profiles and patterns by providing additional information about functionally and/or structurally critical amino acids. Please cite:- Hulo N, Bairoch A, Bulliard V, Cerutti L, Cuche BA, de Castro E, Lachaize C, Langendijk-Genevaux PS, Sigrist CJA. (2008)
For more information see:- http://www.expasy.ch/prosite Appendix - HAMAPMembers of HAMAP families are identified using PROSITE profile collections. HAMAP profiles are manually created by expert curators and they identify proteins that are part of well-conserved bacterial, archaeal and plastid-encoded proteins families or subfamilies. The aim of HAMAP is to propagate manually generated annotation to all members of a given protein family in an automated and controlled way using very strict criteria. Please cite:- Tania Lima, Andrea H. Auchincloss, Elisabeth Coudert, Guillaume Keller, Karine Michoud, Catherine Rivoire, Virginie Bulliard,
Edouard de Castro, Corinne Lachaize, Delphine Baratin, Isabelle Phan, Lydie Bougueleret and Amos Bairoch (2009)
Appendix - PfamPfam is a collection of protein family alignments which were constructed semi-automatically using hidden Markov models (HMMs). Sequences that were not covered by Pfam were clustered and aligned automatically, and are released as Pfam-B. Pfam families have permanent accession numbers and contain functional annotation and cross-references to other databases, while Pfam-B families are re-generated at each release and are unannotated. Please cite:-
Bateman, A., Birney, E., Cerruti, L., Durbin, R., Etwiller, L., Eddy, S.R., Griffiths-Jones, S., Howe, K.L.,
Marshall, M., Sonnhammer, E.L.L. (2002)
For more information see:-
Appendix - PRINTSPRINTS is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family; its diagnostic power is refined by iterative scanning of OWL. Usually the motifs do not overlap, but are separated along a sequence, though they may be contiguous in 3D-space. Fingerprints can encode protein folds and functionalities more flexibly and powerfully than can single motifs: the database thus provides a useful adjunct to PROSITE. Please cite:-
Attwood, T.K., Bradley, P., Flower, D.R., Gaulton, A., Maudling, N., Mitchell, A.L., Moulton, G., Nordle, A.,
Paine, K., Taylor, P., Uddin, A., Zygouri, C. (2003)
For more information see:- http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/ Appendix - ProDomThe ProDom protein domain database consists of an automatic compilation of homologous domains. Current versions of ProDom are built using a novel procedure based on recursive PSI-BLAST searches. The ProDom database has been designed as a tool to help analyze domain arrangements of proteins and protein families. Strong emphasis has been put on the graphical user interface which allows for interactive analysis of protein homology relationships. Please cite:-
Corpet F, Servant F, Gouzy J, Kahn D (2000)
For more information see:- http://www.toulouse.inra.fr/prodom.html Appendix - SMARTSMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures. More than 500 domain families found in signalling, extracellular and chromatin-associated proteins are detectable. These domains are extensively annotated with respect to phyletic distributions, functional class, tertiary structures and functionally important residues. Each domain found in a non-redundant protein database as well as search parameters and taxonomic information are stored in a relational database system. User interfaces to this database allow searches for proteins containing specific combinations of domains in defined taxa. Please cite:-
Letunic, I., Goodstadt, L., Dickens, N.J., Doerks, T., Schultz, J., Mott, R., Ciccarelli, F., Copley, R.R.,
Ponting, C.P., Bork, P. (2002)
For more information see:- http://smart.embl-heidelberg.de/ Appendix - TIGRFAMsTIGRFAMs is a collection of protein families, featuring curated multiple sequence alignments, hidden Markov models (HMMs) and annotation, which provides a tool for identifying functionally related proteins based on sequence homology. Those entries which are "equivalogs" group homologous proteins which are conserved with respect to function. Please cite:-
Haft, D.H., Selengut, J.D., White, O. (2003)
For more information see:- http://www.jcvi.org/cms/research/projects/tigrfams/overview/ Appendix - PIRSFThe PIRSF protein classification system is a network with multiple levels of sequence diversity from superfamilies to subfamilies that reflects the evolutionary relationship of full-length proteins and domains. The primary PIRSF classification unit is the homeomorphic family, whose members are both homologous (evolved from a common ancestor) and homeomorphic (sharing full-length sequence similarity and a common domain architecture). Please cite:- Wu CH, Yeh LL, Huang H, Arminski L, Castro-Alvear J, Chen Y, Hu Z, Ledley RS, Kourtesis P, Suzek BE, Vinayaka CR,
Zhang J, Barker WC. (2003)
For more information see:- http://pir.georgetown.edu/pirwww/dbinfo/pirsf.shtml Appendix - SUPERFAMILYSUPERFAMILY is a library of profile hidden Markov models that represent all proteins of known structure. The library is based on the SCOP classification of proteins: each model corresponds to a SCOP domain and aims to represent the entire SCOP superfamily that the domain belongs to. SUPERFAMILY has been used to carry out structural assignments to all completely sequenced genomes. The results and analysis are available from the SUPERFAMILY website. Please cite:-
Gough, J., Karplus, K., Hughey, R. and Chothia, C. (2001)
For more information see:- http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/ Appendix - Gene3DGene3D is a library of hidden Markov models that represent all proteins of known structure. The seed alignments for the models are derived from the proteins found within the homologous superfamily (H-level) classification level in CATH, which groups together domains that are thought to share a common ancestor. In CATH, similarities at H-level are identified first by sequence comparisons and subsequently by structure comparisons using SSAP. Gene3D has been used to carry out structural assignments to all completely sequenced genomes. The results and analysis are available from the Gene3D website. With InterPro release 12.2 the Gene3D method accession number has changed to G3DSA: (note the colon). G3DSA is a specific subset of G3D data that is relevant to InterPro and permits direct links to the relevant pages in CATH. Please cite:-
Frances Pearl, Annabel Todd, Ian Sillitoe, Mark Dibley, Oliver Redfern, Tony Lewis, Christopher Bennett, Russell
Marsden, Alistair Grant, David Lee, Adrian Akpor, Michael Maibaum, Andrew Harrison, Timothy Dallman, Gabrielle
Reeves, Ilhem Diboun, Sarah Addou, Stefano Lise, Caroline Johnston, Antonio Sillero, Janet Thornton, and
Christine Orengo (2005) For more information see:- http://www.cathdb.info Appendix - PANTHERPANTHER HMMs define protein families, and subfamilies modelled on the divergence of specific functions within the families; this permits more accurate association with function based on ontology terms and pathways, as well as inference of amino acids important for functional specificity. Please cite:-
Huaiyu Mi, Betty Lazareva-Ulitsky, Rozina Loo, Anish Kejariwal, Jody Vandergriff, Steven Rabkin, Nan Guo,
Anushya Muruganujan, Olivier Doremieux, Michael J. Campbell, Hiroaki Kitano, and Paul D. Thomas (2005)
For more information see:- http://www.pantherdb.org Appendix - HLRNA large proportion of HMMER-based calculations are performed on the IBMP690 Supercomputer at HLRN. We would like to thank Dr. Steffen Schulze-Kremer and the HLRN staff for their continued and valuable assistance. Copyright NoticeInterPro - Integrated Resource Of Protein Families, Domains And Functional Sites Copyright (C) 2001 The InterPro Consortium. This manual and the accompanying database may be copied and redistributed freely, without advance permission, provided that this Copyright statement is reproduced with each copy. What does this Copyright Notice mean? The InterPro member databases agreed that all data in InterPro and on the InterPro ftp server is freely distributable and no license agreements are required for use. For the databases that normally do require licenses for commercial use, all data from those databases that are distributed with the InterProScan package may be considered free and public and not subject to license agreements. These databases may have additional information not distributed by InterPro which is then subject to licensing.
|
|||||||||||
InterPro 35.0
|
||||||||||||