spacer

Tutorial

Introduction to the Tutorial

The aim of this tutorial is to provide information on InterPro, how to extract information from InterPro and how to use InterPro to analyse and annotate sequences using the web interface. Full InterPro release information and documentation are available from the InterPro Home page.

InterPro is a searchable database providing information on sequence function and annotation. Sequences are grouped based on protein signatures or 'methods'. These groups represent superfamilies, families or sub-families of sequences or groups of sequences that have one or more sequence features in common. The groups may be defined (or typed) as FAMILIES, DOMAINS, REGIONS, REPEATS OR SITES. The function of sequences within any group may be confined to a single biological process, a diverse range of functions or the group may be functionally uncharacterised. All entries have an abstract and references are provided where possible. It is well worth browsing the database and going through the InterPro FAQs before proceeding.

Introduction to InterPro

What is InterPro?

InterPro is an integrated documentation resource for protein families, domains, regions and sites. InterPro combines a number of databases (referred to as member databases) that use different methodologies and a varying degree of biological information on well-characterised proteins to derive protein signatures. By uniting the member databases, InterPro capitalises on their individual strengths, producing a powerful integrated database and diagnostic tool (InterProScan).

The member databases use a number of approaches:

  1. ProDom: provider of sequence-clusters built from UniProtKB using PSI-BLAST.
  2. PROSITE patterns: provide of simple regular expressions.
  3. PROSITE profiles and HAMAP: provide sequence matrices.
  4. PRINTS provide fingerprints, which are groups of aligned, un-weighted Position Specific Sequence Matrices (PSSMs).
  5. PANTHER, PIRSF, Pfam, SMART, TIGRFAMs, Gene3D and SUPERFAMILY: are providers of hidden Markov models (HMMs).

Diagnostically, these resources have different areas of optimum application owing to the different underlying analysis methods. In terms of family coverage, the protein signature databases are similar in size but differ in content. While all of the methods share a common interest in protein sequence classification, some focus on divergent domains (e.g., Pfam), some focus on functional sites (e.g., PROSITE), and others focus on families, specialising in hierarchical definitions from superfamily down to subfamily levels in order to pin-point specific functions (e.g., PRINTS). TIGRFAMs focus on building HMMs for functionally equivalent proteins and PIRSF always produce HMMs over the full length of a protein and have protein length restrictions to gather family members. HAMAP profiles are manually created by expert curators they identify proteins that are part of well-conserved bacterial, archaeal and plastid-encoded proteins families or subfamilies. PANTHER build HMMs based on the divergence of function within families. SUPERFAMILY and Gene3D are based on structure using the SCOP and CATH superfamilies, respectively, as a basis for building HMMs.

Understanding how InterPro Entries are created

InterPro records (IPR) or entries are created from new protein signatures provided by the participating member databases. The new protein signatures are compared to all the current entries in UniProtKB. Signatures describing the same protein family or domain are grouped into unique InterPro entries. Each combined InterPro entry has a unique accession number, an abstract describing the features of proteins associated with the entry, literature references, and links to the relevant member database(s). All UniProtKB protein sequences that have matches to a particular InterPro entry are listed in the Match Table associated with that entry. There are also links to the InterPro graphical views.

  • If a new protein signature is unique, in other words it identifies a group of sequences which are not currently in InterPro, it will be assigned a new InterPro accession number and a new entry will be created.
  • If the new protein signature identifies sequences already in InterPro but does not overlap with the adjacent protein signature(s), a new InterPro entry will be created.

If the new protein signature is found to overlap with or associate with one or more existing signatures it will either:

  • be merged with an existing entry when the new protein signature overlaps with an existing signature in more than 75% of the sequences (this does not create a new InterPro entry).
  • become a PARENT or CHILD of an existing entry (creates new entry).
  • become CONTAINS or FOUND IN existing entries (creates new entry).

Understanding InterPro TYPES

Type defines the entry as a Family, Domain, Region, Repeat or Site. Sites are sub-classified into either Conserved Sites, Active Sites, Binding Sites or PTMs (Post-translational Modifications).

Families, Regions and Domains

With release 19.0 InterPro introduced a new Type, Region, and new entry classification rules that affect the typing of entries:

  • Entries typed Family or Domain follow stricter criteria to ensure they conform more closely to current biological concepts:
    • Entries typed Family contain signatures that cover all domains in the matching proteins and span >80% of the protein length with no adjacent signatures of type Domain or Region in >90% of the entry protein set.
    • Entries typed Domain identify biological units with defined boundaries, which includes structural and functional domains as well as defined sub-domains.
    • Region is the new InterPro signature type. It defines signatures that cannot be typed as either Family or Domain. In general terms it does not cover all domains or sequence features and, as with domains, there may be one or more non-overlapping Regions mapping to same proteins in the entry.

Repeats and Sites are FEATURES of sequences

REPEATS and SITES - CONSERVED SITES (includes MOTIFS), BINDING SITES, ACTIVE SITES or POST-TRANSLATIONAL MODIFICATIONS (PTMs) are features of sequences and can only ever be FOUND IN: FAMILIES, REGIONS, DOMAINS; and SITES. SITES cannot themselves CONTAIN other signatures.

Description of InterPro Repeats:

In entries typed Repeat the signature is generally <50aa in length and can be repeated many times within a single sequence e.g. Armadillo repeat, Tetratricopeptide TPR_1.

Description of InterPro Sites:

InterPro sites are sub-classified into four separate types:

  1. Conserved Site (includes Motifs), this is any short sequence pattern that may contain one or more unique residues and cannot be defined as a Active Site, Binding Site or post-translational modification (PTM). It includes features that can be described as 'motif'.
  2. Active sites are best known as the catalytic pockets of enzymes where a substrate is bound and converted to a product, which is then released. Distant parts of a protein's primary structure may be involved in the formation of the catalytic pocket. Therefore, to describe an active site, one or more signatures will be needed to cover all the active site residues. To be classed as an Active Site the amino acids involved in the reaction must be described and mutational inactivation studies reported.
  3. Binding sites bind chemical compounds, which themselves are not substrates for a reaction. The compound, which is bound, may be a required co-factor for a chemical reaction, be involved in electron transport or be involved in protein structure modification. The binding is reversible and to be classed as a Binding Site the amino acids involved in the reaction must be described and mutational inactivation studies reported.
  4. A Post-translational Modification modifies the primary protein structure. This modification may be necessary for activation or de-activation of function. Examples include glycosylation, phosphorylation, and sulphation, splicing etc. The process of modification may be permanent or reversible and the process may be required for functional activation or deactivation. To be recognised in InterPro the sequence signature must be described.

Understanding Relationships between InterPro Entries

Relationships are also described in the user manual.

InterPro allows relationships between entries; these are 'PARENT/CHILD' and 'CONTAINS/FOUND IN'.

PARENT/CHILD relationships are used to indicate protein 'SUPERFAMILY/FAMILY/SUBFAMILY' correlations. PARENT/CHILD relationships are permitted between entries of different types: Families, Domains and Regions. Each InterPro entry contains a collection of related protein sequences defined by one or more (overlapping) protein signatures from one or more of the member databases. A CHILD will represent a specific subset of these sequences defined by one or more protein signatures other than the signature(s) which define the PARENT family. The PARENT and CHILD signatures must overlap significantly (>50%) and both must be present in more than 75% of the sequences of the CHILD. It therefore follows that a single protein sequence cannot exist in another subset (a SIBLING set of sequences) of the same PARENT. A CHILD should have no adjacent signatures that are themselves overlapped by the PARENT signature; except for signatures that are defined as REPEAT or SITE, which are sequence features and do not affect relationships. The PARENT signature must be found in a greater number of proteins than any of its CHILDREN. Consequently, the sum of the proteins of all the CHILDREN may be equal to or in a few cases marginally greater than the number of proteins of the parent.

Parent/Child relationships are permitted for Repeats and for Sites but NOT to other entry types.

CONTAINS/FOUND IN relationships can be applied to REGIONS, DOMAINS, REPEATS and SITES. It attempts to describe the composition or structure of protein sequences in the InterPro entry. For such a relationship to be recorded in an InterPro entry an arbitrary minimum of 40% of the sequences associated with that InterPro entry have to CONTAIN or be FOUND IN the signature and additionally the relationship has to have biological sense or meaning. If the sequences identified by the protein signature constitute <40% yet appear to be a subset of the FAMILY the CONTAINS/FOUND IN rule will be applied. In other words the 40% rule applies in both directions.

The CONTAINS/FOUND IN rule states that the longer signature is always the container, while the shorter signature is the component (i.e. the larger signature contains the shorter signature). This rule is valid even if the shorter signature (the component) has a greater number of protein matches than the larger one (the container).

In summary:

  • FAMILIES cannot CONTAIN or be FOUND IN FAMILIES
  • FAMILIES cannot be FOUND IN REGIONS, DOMAINS, REPEATS or SITES
  • FAMILIES can CONTAIN REGIONS, DOMAINS, REPEATS and SITES
  • REGIONS cannot CONTAIN FAMILES
  • REGIONS can CONTAIN REGIONS, DOMAINS, REPEATS and SITES
  • DOMAINS can CONTAIN REGIONS, DOMAINS, REPEATS and SITES
  • REPEATS can CONTAIN SITES
  • REPEATS can only be FOUND IN FAMILIES, REGIONS and DOMAINS
  • SITES can only be FOUND IN FAMILIES, REGIONS, DOMAINS and REPEATS

All other CONTAINS/FOUND IN relationships are forbidden.

For further details on InterPro Types see the User Manual

Colour scheme for graphical views

In the examples of protein matches associated with each entry and in the graphical overview, the matches are displayed as coloured bars above and/or below the protein sequence line. The colours selected are random, other than those associated with structural features (see below), and not related to the colours associated with the signatures for the member databases. A key table near the bottom of the page links InterPro entries to colours and identifies the structural features associated with protein matches, e.g. see IPR006141.

In the Detailed Graphical View each member database method with a match to the protein sequence is displayed on a horizontal line. The position of the match reflects its position on the sequence. Hovering the mouse over a coloured bar will activate a pop-up box displaying the method database, accession number and the residues corresponding to the position of the match on the protein.

The member database accession number is linked to the member database summary page for that signature and the name of the signature is given in the far right column. In the Detailed Graphical View, the protein signatures of the member databases are given specific colours:

  • Gene3D - purple
  • HAMAP - aqua
  • PANTHER - brown
  • Pfam-A - dark blue
  • Pfam-B - blue
  • PIRSF - pink
  • PRINTS - green
  • ProDom - light blue
  • PROSITE patterns - yellow
  • PROSITE profiles - orange
  • SMART - red
  • SUPERFAMILY - black
  • TIGRFAMs - teal

All PRINTS, Pfam, SMART, TIGRFAMs, PIRSF, SUPERFAMILY, Gene3D, PANTHER, and PROSITE and HAMAP profile matches against UniProtKB are flagged as TRUE (T), and displayed if the score is above the individual threshold(s) given by the member databases.

For PROSITE pattern matches to UniProt/Swiss-Prot and UniProtKB/TrEMBL sequences, a pattern match is tested using a miniprofile or a related PROSITE profile ( see Nucleic Acids Res. 38 (Database Issue):D245-D249. for full details). When a PROSITE pattern match is confirmed the match is tagged as status = TRUE (T) and displayed otherwise it is not displayed.

The key table near the bottom of the page identifies which colours correspond to which member database method or structural feature.

Curated structural information is available only for those proteins with a structure in the PDB. SWISS-MODEL and MODBASE represent non-experimental structural models. They are represented on sequences as coloured and white striped lines:

  • SCOP - black and white
  • CATH - purple and white
  • PDB - green and white
  • SWISS-MODEL - red and white
  • MODBASE - yellow and white

Warning: SWISS-MODEL and MODBASE structures are predicted and MUST be interpreted with caution as they could contain errors.

In the Detailed Graphical View the structural matches are listed at the bottom of the view. The classification ID is a link into the SCOP and CATH classification hierarchies, and the domain ID is a link to the domain itself (this ID contains the PDB identifier). For PDB entries the links go to the Protein Databank in Europe (PDBe) (Protein Structure Database) page of the appropriate PDB entries. For SWISS-MODEL the link goes to the appropriate page of the SWISS-MODEL Repository.

InterPro Information Mining

There are a number of ways one can query the InterPro database using text queries.

Advanced text search

The Advanced Search option allows separate fields of the entry to be searched:

  • Search entries: searches names, abstracts, publications, cross references, GO terms and the GO term hierarchy.
  • InterPro accession: returns the InterPro accession given an accession number.
  • Protein accession/ID(s): returns the single protein view for a single supplied UniProtKB accession.
  • Multiple protein accessions or IDs can be provided as a comma separated list.
  • Entry of type: returns the list of all entries of the selected type.

Suggested Queries:

  • Type in or paste an InterPro accession into the box to return the InterPro entry with the given accession number.
  • Accession numbers are of the form IPRxxxxxx where the x's are digits.
  • Find an entry that has a database cross reference to MEROPS but does not include the term Peptidase in either the name or short name.

Have a look at all the InterPro accessions associated with P04709, consider the FAMILY tree and identify the most specific InterPro entry for P04709. Compare a couple of proteins that are related to P04709 as siblings, and some related at the superfamily (Grandparent) level.

InterPro through SRS

As InterPro is indexed in SRS, it can be searched directly or indirectly as a database linked to other databases, permitting complex queries to be defined. The output format can either be one of the preset formats or a user defined format - termed a VIEW.

To use the standard SRS interface go to SRS on the main EBI page and either start a temporary or permanent project. A temporary project lasts only until you close the SRS browser window. A permanent project will last until you choose to delete it. Any of the SRS linked databases, including InterPro, can be searched using 'Quick Search' or with the 'Standard' or 'Extended Query forms'. The expandable button in the Protein function databases section on the Search page takes you to links to InterPro and related databases. Selecting InterPro and clicking 'standard query form' will take you directly to the InterPro expanded search page. Complex queries can be performed to search the fields in the XML file of an InterPro entry and making use of the 'Combine search terms' buttons on the left. Searching InterPro in SRS returns the SRS view of an InterPro record, which differs from the InterPro view.

Notes on InterPro Entry Formats

The SRS view of an InterPro entry differs from the InterPro view in a number of important respects:

  • In the SRS view the Table and Graphical matches are to be found at the bottom of the record.
  • There is no FAMILY Tree display available in SRS view of an InterPro record, instead, clicking on the blue triangle (CHILD) displays all the records of the CHILDREN, but not the GRANDCHILDREN, as a flatfile. The PARENT (click on the red triangle), of any CHILD, displays only the immediate PARENT of that CHILD and not the GRANDPARENT. The IPR accession number and name and each IPR accession number is linked to the entry record.

The format of an InterPro entry is fully described in the user manual and an example entry is provided in the Documentation pages.

It is advisable for users unfamiliar with SRS to read the SRS help pages before commencing text searches. Queries are used to search the fields selected in a query form. Results can be returned either in Table or List format depending on which is selected. One can select a different view on the results page.

Creating Views from the SRS View Manager Pages

It is advisable to familiarise yourself with SRS 'Views' - The documentation pages are:

To create a view to display the results of a search:

  1. click on 'Views' at the top of the SRS page
  2. at the top left of the View page fill in the 'Create View Options', create a view name (e.g. NewView)
  3. select 'Table' or 'List' and 'All Fields in database' or 'Common Fields only'
  4. select a database to define the view (e.g. UniProtKB)
  5. select databanks to be linked to displayed entry (e.g. InterPro)
  6. click on 'Create New View'
  7. choose which parts of the databases you would like to view e.g. ID and EntryName for UniProtKB and AccNumber and full_Name for InterPro
  8. then click 'Save New View'
  9. go to the SRS 'Top Page' and choose UniProtKB
  10. select either 'standard' or 'extended search' for a complex query if required
  11. Choose a query
  12. Submit query
  13. Once the search results are retrieved, go to 'Results' link from top of SRS page
  14. Select (tick) the query just performed and in the 'Results Display Options' box select the view you have created from the 'View results using' scrollbar; choose the number of records to display and then click 'rerun query'.

This should now display the results of the search in the view that was defined.

There are a number of default views that can be used, try these to see their formats.

Suggested Queries:

  • Try the query above to get all human androgen receptors or all amylases.
  • In the 'Results' page it is also possible to view the sequences in FASTA format and launch a BLAST search or additional searches, or even align all the sequences recovered using CLUSTALW.
  • In the 'Results' page you may also submit a link to other databases, by clicking on Link' and choosing a database. Try finding the links from the results to the Mutation database, MIM or to 3D structure databases.

InterPro Sequence Analysis: InterProScan

The web based server of InterProScan is useful tool for the individual researcher to analyse and characterise limited numbers of unknown sequences. See the README file and FAQs file for general information; and read InterProScan HELP, and the 2can Tutorial for general advice and guidance.

Protein sequence submission is limited to one sequence only. Please contact support for help in submitting multiple sequences. The sequence can either be cut and pasted into the text window or uploaded as a file.

  • For inputting a protein sequence(s) use: free text/Raw, FASTA or UniProtKB formats.

Partially formatted sequences are not accepted. Copying and pasting directly from word processors may yield unpredictable results as hidden/control characters may be present. Adding a return to the end of the sequence may help certain applications understand the input.

Characterisation and annotation of sequences

Users can either choose to perform an interactive run where the results are returned to their screen or choose to have the results sent to an email address. The later may be more convenient, in some circumstances, as some analyses take several minutes (or more) and depend on the load on the server; sequences are queued in the order they are received.

InterProScan initially calculates the CRC64 of the submitted sequence and compares it to the InterPro XML file. If a match to the CRC64 is found, InterProScan returns the matches for that sequence. Otherwise it takes the sequence and analyses it against a number of different databases, each of which represents one of the member databases and which have preconfigured cut off thresholds. Following analysis each result is returned and combined, and then the InterPro entries and the sequence signatures are returned to the submitter. The results are presented as a graphical view. Which, depending on the file size, is kept for a period of time on the EBI server. There is also the option to view the results in a table or in XML format.

InterPro is a 'one-stop shop' for protein sequence analysis, but in some instances as we will see in one of the examples below, it may be necessary to further analyse sequences using one of InterPro's member databases to resolve conflicts or retrieve detailed statistics. The links to the individual database sequence submission forms are:

Pfam, PRINTS, PRODOM, PROSITE, HAMAP, TIGRFAMs, SMART, PIRSF, SUPERFAMILY, Gene3D, PANTHER,

Sequence Analysis

This section aims to familiarise users with using InterProScan as a tool to characterise and annotate sequences.

Open an InterProScan window

Here are a number of unidentified sequences: Sequence-1, Sequence-2, Sequence-3, Sequence-4, Sequence-8_isoform-1, Sequence-8_isoform-2 to analyse:

Sequences 1 and 3 are for DOMAIN matches, sequence-2 is a FAMILY analysis, sequence-4 gives a number of hits to PRINTS methods raising the question as to which of them are true positive matches, and which can be considered as false positives. Use the PRINTS website to determine the TRUE from the FALSE matches. Sequence-1 and Sequence-8, which has two isoforms, are for individuals more interested in plants.

Protein Sequence Analysis

There are two projects:

Project A: Follow the InterProScan tutorial provided in the 2can Introduction to Protein and Proteomic Analysis.

Open an InterProScan window.

Copy the sequence provided into the InterProScan sequence search form or use Sequence-8_Plant (if you prefer a plant sequence); select the applications to run, select 'interactive run' and submit.

Follow the outlined procedure and answer the questions.

Project B: Follow the procedure outlined below to analyse the sequences:

Open an InterProScan window

  1. Click on one of the sequence files provided.
  2. Cut and paste the sequence into the sequence search form, add your email address, select the applications to run, select a reading frame size, select 'interactive run' and submit
  3. OR cut and paste the sequence into the sequence search form, add your email address, select the applications to run, select a reading frame size, select 'by mail server' and submit.
  4. The result is returned either as the 'Picture View' directly to your browser or as a hyperlink to the 'Picture View' in the email.
  5. Using the tool bar buttons look at the various output formats: Table View, Raw Output, and XML output.
  6. Mouse-over the matching methods; a pop-up window appears that provides the boundaries of the method, the E-value of the match and the name of the method.
  7. From the Picture View identify what the InterPro entries represent: FAMILIES, DOMAINS, REPEATS etc.
  8. Utilising the links identify which provides a link to the SRS or InterPro view of the entry and note the differences
  9. Are there any PARENT/CHILD or CONTAINS/FOUND IN relationships associated with the returned InterPro entries?
  10. In the SRS view of the InterPro entry click on the PARENT/CHILD red/blue triangles to see the related entries.
  11. Determine the relationships and domain structure (if any) between the InterPro entries in the Picture View?
  12. Use the FAMILY relationship to classify the protein at the superfamily, family and family levels.
  13. Which is the most specific match?
  14. Go to UniProtKB or SRS and pull out all UniProtKB entries, do they have some annotation (DE line) in common?
  15. Use the InterPro view of the InterPro entry to identify the taxonomic range of the protein FAMILY or DOMAIN. If the entry contains a SMART or Pfam signature go their home pages and compare the taxonomy information to the InterPro taxonomy view.
  16. Are there any Gene Ontology terms linked to the InterPro entries? If so, view the terms in QuickGO using the link from the entries.
  17. Use the GO ID in the text search to list all InterPro entries associated with this GO term. Also search for the term in QuickGO, the view of the term will show any InterPro entries linked to the term.
  18. From the InterProScan results in SRS, link to the appropriate GO terms by choosing 'Link', then ticking GO and 'Submit link'. This is another way of retrieving the associated GO terms and all other InterPro entries these terms are associated with.
  19. Use BLAST to identify the sequence with the best homology. Does the BLAST result support the InterProScan result?

For help hints and guidance analyse Sequence-2 first and use the Seq-2 Help file.

--------------------------------------

For Sequence-1, determine if there are any InterPro relationships for Sequence-1.

  • Why is IPR005630 consider to be a DOMAIN rather than a FAMILY?
  • What are the relationships between the domains?

Check the table view to confirm your result.

  • Why is IPR005630 considered to be a CHILD of IPR008949 rather CONTAINING IPR008949?
  • What annotation could be associated with this sequence?

Sequence-1 Result

--------------------------------------

For Sequence-2, what is the relationship between these InterPro entries? Which is the most specific and what annotation would you attach to this sequence?

Sequence-2 Result

This file illustrates the analysis of Sequence-2: Sequence-2 analysis result

--------------------------------------

For Sequence-3 , identify the different DOMAINS in this sequence.

Consider why the InterPro entry IPR000719 has both 'FOUND IN' and 'CONTAINS' relationships. Follow the links and find other related proteins. Submit this protein to the PROSITE sequence scan. Do the results support the analysis provided by InterPro?

Sequence-3 Result

--------------------------------------

For Sequence-4 ;

consider the result and determine the relationships of the InterPro entries. If the result is taken as being representative, are there any relationships missing?

Run the sequence through fingerPRINTScan on the PRINTS website to resolve the false positive from the true hit:

fingerPRINTScan

Use the PRINTS results to identify which is the correct relationship, and read the PRINTS information to determine why it is a false positive.

Sequence-4 Result

--------------------------------------

DNA/RNA Sequence Analysis

InterProScan now has the ability to analyse DNA sequences, though at present only ONE sequence per analysis is permitted. The sequence is translated in all six frames and the translation products are queried against InterPro. The user can set the minimum reading frame length (20 nucleotides to 150 nucleotides).

ESTs and Genes

The program will analyse both genes with introns and ESTs, providing that an open reading frame is equal to or greater than the selected minimum frame size. It will return information providing that a partial or full match to a InterPro signature is present. It is advisable, therefore in analysing sequences that may contain sequencing errors or introns to set the reading frame size to 20 in the first instance.

Tasks:

  • Sequence-5, and Sequence-6, are coding sequence and gene respectively; analyse each separately and view the outputs
  • Modify sequence 5 by either shifting the reading frame or introducing frame shifting bases into the coding sequence and resubmit the sequence for analysis
  • Check your results by submitting Sequence-7, for analysis

Sequence-5 InterProScan output

Sequence-6 InterProScan output

Sequence-7 Result

-------------------------------------------------------------------------

If you have any suggestion or questions please contact us at: EBI Support.

spacer
InterPro 35.0