![]() |
TutorialIntroduction to the TutorialThe aim of this tutorial is to provide information on InterPro, how to extract information from InterPro and how to use InterPro to analyse and annotate sequences using the web interface. Full InterPro release information and documentation are available from the InterPro Home page. InterPro is a searchable database providing information on sequence function and annotation. Sequences are grouped based on protein signatures or 'methods'. These groups represent superfamilies, families or sub-families of sequences or groups of sequences that have one or more sequence features in common. The groups may be defined (or typed) as FAMILIES, DOMAINS, REGIONS, REPEATS OR SITES. The function of sequences within any group may be confined to a single biological process, a diverse range of functions or the group may be functionally uncharacterised. All entries have an abstract and references are provided where possible. It is well worth browsing the database and going through the InterPro FAQs before proceeding. Introduction to InterProWhat is InterPro?InterPro is an integrated documentation resource for protein families, domains, regions and sites. InterPro combines a number of databases (referred to as member databases) that use different methodologies and a varying degree of biological information on well-characterised proteins to derive protein signatures. By uniting the member databases, InterPro capitalises on their individual strengths, producing a powerful integrated database and diagnostic tool (InterProScan). The member databases use a number of approaches:
Diagnostically, these resources have different areas of optimum application owing to the different underlying analysis methods. In terms of family coverage, the protein signature databases are similar in size but differ in content. While all of the methods share a common interest in protein sequence classification, some focus on divergent domains (e.g., Pfam), some focus on functional sites (e.g., PROSITE), and others focus on families, specialising in hierarchical definitions from superfamily down to subfamily levels in order to pin-point specific functions (e.g., PRINTS). TIGRFAMs focus on building HMMs for functionally equivalent proteins and PIRSF always produce HMMs over the full length of a protein and have protein length restrictions to gather family members. HAMAP profiles are manually created by expert curators they identify proteins that are part of well-conserved bacterial, archaeal and plastid-encoded proteins families or subfamilies. PANTHER build HMMs based on the divergence of function within families. SUPERFAMILY and Gene3D are based on structure using the SCOP and CATH superfamilies, respectively, as a basis for building HMMs. Understanding how InterPro Entries are createdInterPro records (IPR) or entries are created from new protein signatures provided by the participating member databases. The new protein signatures are compared to all the current entries in UniProtKB. Signatures describing the same protein family or domain are grouped into unique InterPro entries. Each combined InterPro entry has a unique accession number, an abstract describing the features of proteins associated with the entry, literature references, and links to the relevant member database(s). All UniProtKB protein sequences that have matches to a particular InterPro entry are listed in the Match Table associated with that entry. There are also links to the InterPro graphical views.
If the new protein signature is found to overlap with or associate with one or more existing signatures it will either:
Understanding InterPro TYPESType defines the entry as a Family, Domain, Region, Repeat or Site. Sites are sub-classified into either Conserved Sites, Active Sites, Binding Sites or PTMs (Post-translational Modifications). Families, Regions and DomainsWith release 19.0 InterPro introduced a new Type, Region, and new entry classification rules that affect the typing of entries:
Repeats and Sites are FEATURES of sequencesREPEATS and SITES - CONSERVED SITES (includes MOTIFS), BINDING SITES, ACTIVE SITES or POST-TRANSLATIONAL MODIFICATIONS (PTMs) are features of sequences and can only ever be FOUND IN: FAMILIES, REGIONS, DOMAINS; and SITES. SITES cannot themselves CONTAIN other signatures. Description of InterPro Repeats:In entries typed Repeat the signature is generally <50aa in length and can be repeated many times within a single sequence e.g. Armadillo repeat, Tetratricopeptide TPR_1. Description of InterPro Sites:InterPro sites are sub-classified into four separate types:
Understanding Relationships between InterPro EntriesRelationships are also described in the user manual. InterPro allows relationships between entries; these are 'PARENT/CHILD' and 'CONTAINS/FOUND IN'. PARENT/CHILD relationships are used to indicate protein 'SUPERFAMILY/FAMILY/SUBFAMILY' correlations. PARENT/CHILD relationships are permitted between entries of different types: Families, Domains and Regions. Each InterPro entry contains a collection of related protein sequences defined by one or more (overlapping) protein signatures from one or more of the member databases. A CHILD will represent a specific subset of these sequences defined by one or more protein signatures other than the signature(s) which define the PARENT family. The PARENT and CHILD signatures must overlap significantly (>50%) and both must be present in more than 75% of the sequences of the CHILD. It therefore follows that a single protein sequence cannot exist in another subset (a SIBLING set of sequences) of the same PARENT. A CHILD should have no adjacent signatures that are themselves overlapped by the PARENT signature; except for signatures that are defined as REPEAT or SITE, which are sequence features and do not affect relationships. The PARENT signature must be found in a greater number of proteins than any of its CHILDREN. Consequently, the sum of the proteins of all the CHILDREN may be equal to or in a few cases marginally greater than the number of proteins of the parent. Parent/Child relationships are permitted for Repeats and for Sites but NOT to other entry types. CONTAINS/FOUND IN relationships can be applied to REGIONS, DOMAINS, REPEATS and SITES. It attempts to describe the composition or structure of protein sequences in the InterPro entry. For such a relationship to be recorded in an InterPro entry an arbitrary minimum of 40% of the sequences associated with that InterPro entry have to CONTAIN or be FOUND IN the signature and additionally the relationship has to have biological sense or meaning. If the sequences identified by the protein signature constitute <40% yet appear to be a subset of the FAMILY the CONTAINS/FOUND IN rule will be applied. In other words the 40% rule applies in both directions. The CONTAINS/FOUND IN rule states that the longer signature is always the container, while the shorter signature is the component (i.e. the larger signature contains the shorter signature). This rule is valid even if the shorter signature (the component) has a greater number of protein matches than the larger one (the container). In summary:
All other CONTAINS/FOUND IN relationships are forbidden. For further details on InterPro Types see the User ManualColour scheme for graphical viewsIn the examples of protein matches associated with each entry and in the graphical overview, the matches are displayed as coloured bars above and/or below the protein sequence line. The colours selected are random, other than those associated with structural features (see below), and not related to the colours associated with the signatures for the member databases. A key table near the bottom of the page links InterPro entries to colours and identifies the structural features associated with protein matches, e.g. see IPR006141. In the Detailed Graphical View each member database method with a match to the protein sequence is displayed on a horizontal line. The position of the match reflects its position on the sequence. Hovering the mouse over a coloured bar will activate a pop-up box displaying the method database, accession number and the residues corresponding to the position of the match on the protein. The member database accession number is linked to the member database summary page for that signature and the name of the signature is given in the far right column. In the Detailed Graphical View, the protein signatures of the member databases are given specific colours:
All PRINTS, Pfam, SMART, TIGRFAMs, PIRSF, SUPERFAMILY, Gene3D, PANTHER, and PROSITE and HAMAP profile matches against UniProtKB are flagged as TRUE (T), and displayed if the score is above the individual threshold(s) given by the member databases. For PROSITE pattern matches to UniProt/Swiss-Prot and UniProtKB/TrEMBL sequences, a pattern match is tested using a miniprofile or a related PROSITE profile ( see Nucleic Acids Res. 38 (Database Issue):D245-D249. for full details). When a PROSITE pattern match is confirmed the match is tagged as status = TRUE (T) and displayed otherwise it is not displayed. The key table near the bottom of the page identifies which colours correspond to which member database method or structural feature. Curated structural information is available only for those proteins with a structure in the PDB. SWISS-MODEL and MODBASE represent non-experimental structural models. They are represented on sequences as coloured and white striped lines:
Warning: SWISS-MODEL and MODBASE structures are predicted and MUST be interpreted with caution as they could contain errors. In the Detailed Graphical View the structural matches are listed at the bottom of the view. The classification ID is a link into the SCOP and CATH classification hierarchies, and the domain ID is a link to the domain itself (this ID contains the PDB identifier). For PDB entries the links go to the Protein Databank in Europe (PDBe) (Protein Structure Database) page of the appropriate PDB entries. For SWISS-MODEL the link goes to the appropriate page of the SWISS-MODEL Repository. InterPro Information MiningThere are a number of ways one can query the InterPro database using text queries. Simple text searchA Search InterPro box is found at the top of each InterPro page. Any word or phrase can be typed or pasted into the search box. The fields searched are: name, short name, abstract, signature name, cross references, publications and GO terms. A list of InterPro entries containing the search term(s) is returned. Suggested Queries:
Advanced text searchThe Advanced Search option allows separate fields of the entry to be searched:
Suggested Queries:
Have a look at all the InterPro accessions associated with P04709, consider the FAMILY tree and identify the most specific InterPro entry for P04709. Compare a couple of proteins that are related to P04709 as siblings, and some related at the superfamily (Grandparent) level. InterPro through SRSAs InterPro is indexed in SRS, it can be searched directly or indirectly as a database linked to other databases, permitting complex queries to be defined. The output format can either be one of the preset formats or a user defined format - termed a VIEW. To use the standard SRS interface go to SRS on the main EBI page and either start a temporary or permanent project. A temporary project lasts only until you close the SRS browser window. A permanent project will last until you choose to delete it. Any of the SRS linked databases, including InterPro, can be searched using 'Quick Search' or with the 'Standard' or 'Extended Query forms'. The expandable button in the Protein function databases section on the Search page takes you to links to InterPro and related databases. Selecting InterPro and clicking 'standard query form' will take you directly to the InterPro expanded search page. Complex queries can be performed to search the fields in the XML file of an InterPro entry and making use of the 'Combine search terms' buttons on the left. Searching InterPro in SRS returns the SRS view of an InterPro record, which differs from the InterPro view. Notes on InterPro Entry FormatsThe SRS view of an InterPro entry differs from the InterPro view in a number of important respects:
The format of an InterPro entry is fully described in the user manual and an example entry is provided in the Documentation pages. It is advisable for users unfamiliar with SRS to read the SRS help pages before commencing text searches. Queries are used to search the fields selected in a query form. Results can be returned either in Table or List format depending on which is selected. One can select a different view on the results page. Creating Views from the SRS View Manager PagesIt is advisable to familiarise yourself with SRS 'Views' - The documentation pages are: To create a view to display the results of a search:
This should now display the results of the search in the view that was defined. There are a number of default views that can be used, try these to see their formats. Suggested Queries:
InterPro Sequence Analysis: InterProScanThe web based server of InterProScan is useful tool for the individual researcher to analyse and characterise limited numbers of unknown sequences. See the README file and FAQs file for general information; and read InterProScan HELP, and the 2can Tutorial for general advice and guidance. Protein sequence submission is limited to one sequence only. Please contact support for help in submitting multiple sequences. The sequence can either be cut and pasted into the text window or uploaded as a file.
Partially formatted sequences are not accepted. Copying and pasting directly from word processors may yield unpredictable results as hidden/control characters may be present. Adding a return to the end of the sequence may help certain applications understand the input. Characterisation and annotation of sequencesUsers can either choose to perform an interactive run where the results are returned to their screen or choose to have the results sent to an email address. The later may be more convenient, in some circumstances, as some analyses take several minutes (or more) and depend on the load on the server; sequences are queued in the order they are received. InterProScan initially calculates the CRC64 of the submitted sequence and compares it to the InterPro XML file. If a match to the CRC64 is found, InterProScan returns the matches for that sequence. Otherwise it takes the sequence and analyses it against a number of different databases, each of which represents one of the member databases and which have preconfigured cut off thresholds. Following analysis each result is returned and combined, and then the InterPro entries and the sequence signatures are returned to the submitter. The results are presented as a graphical view. Which, depending on the file size, is kept for a period of time on the EBI server. There is also the option to view the results in a table or in XML format. InterPro is a 'one-stop shop' for protein sequence analysis, but in some instances as we will see in one of the examples below, it may be necessary to further analyse sequences using one of InterPro's member databases to resolve conflicts or retrieve detailed statistics. The links to the individual database sequence submission forms are: Pfam, PRINTS, PRODOM, PROSITE, HAMAP, TIGRFAMs, SMART, PIRSF, SUPERFAMILY, Gene3D, PANTHER, Sequence AnalysisThis section aims to familiarise users with using InterProScan as a tool to characterise and annotate sequences. Open an InterProScan window Here are a number of unidentified sequences: Sequence-1, Sequence-2, Sequence-3, Sequence-4, Sequence-8_isoform-1, Sequence-8_isoform-2 to analyse: Sequences 1 and 3 are for DOMAIN matches, sequence-2 is a FAMILY analysis, sequence-4 gives a number of hits to PRINTS methods raising the question as to which of them are true positive matches, and which can be considered as false positives. Use the PRINTS website to determine the TRUE from the FALSE matches. Sequence-1 and Sequence-8, which has two isoforms, are for individuals more interested in plants. Protein Sequence AnalysisThere are two projects: Project A: Follow the InterProScan tutorial provided in the 2can Introduction to Protein and Proteomic Analysis. Open an InterProScan window. Copy the sequence provided into the InterProScan sequence search form or use Sequence-8_Plant (if you prefer a plant sequence); select the applications to run, select 'interactive run' and submit. Follow the outlined procedure and answer the questions. Project B: Follow the procedure outlined below to analyse the sequences: Open an InterProScan window
For help hints and guidance analyse Sequence-2 first and use the Seq-2 Help file. -------------------------------------- For Sequence-1, determine if there are any InterPro relationships for Sequence-1.
Check the table view to confirm your result.
-------------------------------------- For Sequence-2, what is the relationship between these InterPro entries? Which is the most specific and what annotation would you attach to this sequence? This file illustrates the analysis of Sequence-2: Sequence-2 analysis result -------------------------------------- For Sequence-3 , identify the different DOMAINS in this sequence. Consider why the InterPro entry IPR000719 has both 'FOUND IN' and 'CONTAINS' relationships. Follow the links and find other related proteins. Submit this protein to the PROSITE sequence scan. Do the results support the analysis provided by InterPro? -------------------------------------- For Sequence-4 ; consider the result and determine the relationships of the InterPro entries. If the result is taken as being representative, are there any relationships missing? Run the sequence through fingerPRINTScan on the PRINTS website to resolve the false positive from the true hit: Use the PRINTS results to identify which is the correct relationship, and read the PRINTS information to determine why it is a false positive. -------------------------------------- DNA/RNA Sequence AnalysisInterProScan now has the ability to analyse DNA sequences, though at present only ONE sequence per analysis is permitted. The sequence is translated in all six frames and the translation products are queried against InterPro. The user can set the minimum reading frame length (20 nucleotides to 150 nucleotides). ESTs and Genes The program will analyse both genes with introns and ESTs, providing that an open reading frame is equal to or greater than the selected minimum frame size. It will return information providing that a partial or full match to a InterPro signature is present. It is advisable, therefore in analysing sequences that may contain sequencing errors or introns to set the reading frame size to 20 in the first instance. Tasks:
Sequence-5 InterProScan output ------------------------------------------------------------------------- If you have any suggestion or questions please contact us at: EBI Support.
|
|||||||||||
InterPro 35.0
|
||||||||||||