![]() |
User FAQ for InterProWHAT IS INTERPRO:What is the history of the InterPro project, who established it and when?The InterPro database was established in 1999 when the InterPro Consortium was formed between the SWISS-PROT group at EBI and SIB, and the founding member databases Prints, PROSITE, Pfam and ProDom. The first release was later that year. There are several publications on InterPro, please see: R.Apweiler, T.K.Attwood, A.Bairoch, A.Bateman, E.Birney, M.Biswas, P.Bucher, L.Cerutti, F.Corpet, M.D.R.Croning, R.Durbin, L.Falquet, W.Fleischmann, J.Gouzy, H.Hermjakob, N.Hulo, I.Jonassen, D.Kahn, A.Kanapin, Y.Karavidopoulou, R.Lopez, B.Marx, N.J.Mulder, T.M.Oinn, M.Pagni, F.Servant, C.J.A.Sigrist, E.M.Zdobnov. The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Research vol 29(1):37-40 (2001). and Mulder N.J., Apweiler R., Attwood T.K., Bairoch A., Bateman A., Binns D., Biswas M., Bradley P., Bork P., Bucher P., Copley R., Courcelle E., Durbin R., Falquet L., Fleischmann W., Gouzy J., Griffith-Jones S., Haft D., Hermjakob H., Hulo N., Kahn D., Kanapin A., Krestyaninova M., Lopez R., Letunic I., Pagni M., Peyruc D., Ponting C.P., Servant F. and Sigrist C.J.A. InterPro - An integrated documentation resource for protein families, domains and functional sites. Briefings in Bioinformatics 3(3):285-295 (2002). In this issue there are all papers related to InterPro. What is InterPro? What is the difference between member databases and matches?InterPro is a consortium of member databases (PROSITE, Pfam, Prints, ProDom, SMART and TIGRFAMs). Each member database devises methods that can be applied computationally to assign a score for a protein according to how well it matches a given signature. For some types of methods, the classification is binary (i.e. hit or miss), in other cases a numerical value is produced and a cut off point chosen to separate hits from misses. Different member databases create methods/signatures in different ways: some groups build them from alignments studied manually, others use automatic processes with some human input and correction, while ProDom uses an entirely automatic method. See the publications of the different member databases for details. LICENSING ISSUES:What are the licensing issues for InterPro? PROSITE is copyright and commercial companies have to pay to use SMART. Does this mean that we must pay and sign license agreements with PROSITE (and SWISS-PROT) and SMART if we use InterPro?The InterPro member databases agreed that all data in InterPro and on the InterPro ftp server is freely distributable and no license agreements are necessary with the databases. In the case of those databases that normally require licences like SMART and PROSITE, there are no licenses required in InterPro if you use only the files distributed with InterProScan e.g. PROSITE.dat which contains the patterns and profiles and is part of InterProScan. PROSITE.doc and ProRule files are not distributed as part of InterPro and therefore commercial users would need a PROSITE license if they want to install PROSITE fully with all the files (like annotation). In the case of SMART, individual family specific thresholds for HMMs are not distributed freely and do require a license. Since InterPro is distributed under GNU, can LION distribute it to its customers as provided by EBI?Any user can distribute everything on the InterPro ftp site: ftp.ebi.ac.uk/pub/databases/InterPro The database and annotation are copyrighted but freely distributable. RETRIEVING DATA:I would like to use InterPro with an automated data extraction tool using a script, and I want to connect your URL via a tcp protocol. Is this possible, how do I get the data?All our data is available for download in the following forms: *InterPro entry data and protein match lists in XML format. *Lists of InterPro matches to proteins and GO Terms in space delimited list format. *InterPro scan: A tool unifying all the match methods which allows you to match your protein against all the InterPro match methods. You can find documentation at our homepage: http://www.ebi.ac.uk/InterPro/ and at our ftp site: ftp://ftp.ebi.ac.uk/pub/databases/InterPro We also have database dumps which we can provide you with if necessary. We cannot encourage you to run scripts to extract data from the webserver - this process will cause excessive load on our web servers, leading to detrimental effects to our many other users and the servers will refuse to provide the data to you. Can you recommend any tools for working with the XML file? There are various XML editors and syntax checkers, but I have not found anything that can extract a specific set of fields from a file of records.The most flexible way to extract this kind of information from the XML files is using Perl. The XML::Parser module (available for download from CPAN) is relatively easy to use, fast, and efficient. Is there a SQL database behind your InterPro web interface? If yes, can I access it? It seems the 'official' data distribution for InterPro is the XML file. Parsing XML is extraordinarily complex when compared to parsing tables, are the tables available for distribution?InterPro is implemented in an Oracle relational database and is thus available internally for querying using SQL, but due to the potential to block the database with large queries, external access is denied. An Oracle distribution of the database is available on request and eventually we hope to have a distribution for MySQL too. Do you have a mirror site and would you be interested in us setting one up?Currently we are planning mirror sites for InterPro in San Diego and Canada. There are many issues to consider before one can be established. For example, whether your site has not only an Oracle license but also a license to serve out of an Oracle database to the public. The Consortium would need to agree on the distribution of the data, the site will need to be stable and reliable and there are many synchronisation issues too. We are striving to provide the public with a single, synchronised source of data, and hope to maintain this policy. Please can you inform us about your position on these issues and I will approach the rest of the InterPro Consortium to get their views on this. We run bioinformatics training courses for the researchers in which we visit various important and useful web sites and carry out hands on exercises. InterPro is one such resources where we query it using both the 'text search' and 'protein search' pages. Do you have a resource which could give a faster response to specific queries during the training classes?To improve the response time, you might consider asking your students to use protein sequences from UniProt, they should run much faster. InterPro contains not only annotation and scanning methods, but also a complete set of precomputed scans for the publicly available protein sequences. So, if you use a well known sequence, the "protein search" page detects this an returns the precomputed values. It works only if the sequence is completely identical. The "text search" itself should be rather quick. If you experience any delays, please send some details and we will look into it (example queries, the URL of the page you used, and some sort of response time). We can't offer any additional resources right now, as our hardware accelerated machine is a serial device. It runs very fast, but if say 20 users submit queries at the same time, as it is common for courses, they are just processed one after the other. UPDATES:How frequently is InterPro updated, are the XML file and live database in sync?We have been aiming for more regular releases and at the moment it is approximately every two-three months. It does however depend on many different issues for example when we get new data from member databases, when they have releases etc, so it is not always possible to predict exactly when the next release will be. The database is the most up to date, the XML file is updated only with each point release. I saw that InterPro has updated to version 5.0. Is there some technical reason why you have not used the latest version of e.g. TIGRFAMs? Are you always in sync with the latest member database releases?There are generally no technical reasons for not having the latest version of member databases, only issues regarding curator time. All integration is done manually by curator judgement and when many member database releases occur at once it is not always possible to integrate all new methods immediately. We do however try to stay in sync with member database releases wherever possible. METHODS AND ANNOTATION:I've read through your papers and I don't understand how the signatures and relationship fields are created. For example, do you have an automatic algorithm that groups together signatures from multiple databases into a family? Or is it that you manually curate the signatures and relationships?We receive the signatures from the member databases and the process of grouping them into families or domains is entirely manually curated. We use the matches of the new signatures against UniProt proteins to determine whether the matches overlap in position on the sequence and the signatures match the same set of proteins, which is the criterion for grouping signatures into the same InterPro entries. Where one signature matches a subset of another, then we curate relationships between the entries. There are some automatic methods for producing the matches and overlap files, but the ultimate decision for grouping and for determining relationships comes from a biologist. Is there a hierarchy within Pfam domains or with other methods in InterPro?Pfam itself does not have a hierarchy since no 2 Pfam HMMs should ever overlap on a sequence, unless one is a nested domain. However, there may be a Prints or other signature which defines a subset/subfamily of a Pfam family. In InterPro we found many such cases so we have a hierarchy of parent/child relationships. We also have a contains/found in relationship which is used when a domain is found in a family of proteins. In this way there is some sort of hierarchy of Pfam in relation to other signature database signatures. What is your constructive criterion for the 'InterPro the domain'?InterPro domains are "found" because that region on the protein sequence is recognised (hit) by a signature that describes that domain with a score above the threshold set by the member database for that domain signature. All this is set by member databases, some of which use structural information to set domain boundaries. Others, by the nature of the signature method don't. I would like to get a list of all InterPro entries with the name of each protein family. Where can I find this list?The list of InterPro entry names can be downloaded from the ftp site: ftp://ftp.ebi.ac.uk/pub/databases/interpro/names.dat, linked to from the InterPro homepage. How many 'distinct' domains/active sites in InterPro have the same InterPro ID?By definition if two signatures are believed to predict the same domain, they are assigned to the same InterPro ID. InterPro does not specifically predict active sites (except by implication, in that certain domains always contain certain active sites). What kind of errors/ambiguities are more often found in InterPro?In the past some methods have led to the identification of false positives (i.e. entries that match signatures but which do not possess the function normally associated with the presence of this signature). Where we have observed this, the methods have been revised and/or the offending entries flagged and removed. How do you define your entry types and domain boundaries? Are the positions important?Yes they are important but not vital. Everything is in the nature of the methods, for example a PROSITE pattern, since it generally spans just a few amino acids is always short, but it is often used to represent a family and thus virtually spans the whole protein sequence. FingerPrints are the same, the motifs are chosen not for their position or sequence spanning but for small areas where one subfamily is different to another. In the case of HMMs there should be less variation, but the criteria used for building a signature may be different for each database. Pfam uses structural information to get domain boundaries correct, but TIGRFAMs may not necessarily. Therefore, while size/length is important, it is necessary to take the above into consideration and look at individual signatures to see what the member database was intending to achieve. In the case of TIGRFAMs you can look at the entry in TIGRFAMs you will see an Isology Type e.g. hypoth_equivalog this means they intended it to be a family, they usually specify if it must be a domain. Is there a text file showing the InterPro trees with subsequent subfamilies?
We don't have a text file with the tree, but there are three ways to get
the desired data:
1) on the web display, the [tree] labels are clickable, leading to, for
instance, http://www.ebi.ac.uk/InterPro/ITreeDisplay?ipr=IPR000401
2) With the XML file (on the ftp site), just filter out the <InterPro> and
<contains> tags.
3) In case you need an ASCII file, we can generate a one-off report along the
format of: Why is it that not all the ProDom entities are present in InterPro?ProDom is a very special case among InterPro member databases: it is the only database which is based on an automatic method while the others are manually curated. Thus, we decided to include in InterPro only ProDom families that have been manually inspected by a curator. This process is slow, so that explain the low number of ProDom families in InterPro at the moment. To cope with this at the moment, we decided to distribute the entire ProDom database in the InterProScan package very soon. All the integrated entries into InterPro will be mapped to their InterPro entries, and all the non integrated ProDom entries will not. This should allow the users to see at a glance which matches are due to manually curated ProDom entries and which are not. Please note that non-curated ProDom entries are not necessarily bad. They have just not been inspected by a curator yet. PROTEIN ANALYSIS:I am trying to analyse a large protein, about 400 Kda. Doing some Blast searches I have identified a region of about 200 amino acids that corresponds to the known Laminin G-like domain (InterPro accession number IPR001791). LamG domains bear only little sequence identity amongst themselves (~25%), yet in the literature the domain appears in the structural description of many proteins, so from sequences alone, how do I characterise a sequence of ~170-200 amino acids as being a LamG domain?The best thing for you to do is to run your sequence through InterProScan: http://www.ebi.ac.uk/InterPro/scan.html If the sequence has a LamG domain in it then it will hit IPR001791. The signatures in this entry are diagnostic of such domains and are more sensitive than a simple sequence similarity search. For more information on the signatures click on the link and you will be able to get the alignments from the Pfam domain. If the sequence does not hit this entry then it may not have a LamG domain, or the domain sequence has diverged too much for it to be considered the same domain. In addition, in the entry IPR001791 at the bottom of the page there is a link to the condensed graphical view, from which you can get all proteins sharing a common domain and then all proteins with the same domain architecture with an alignment. Most of the domains listed in tables of InterPro matches for proteins occur many (hundreds) times. What sense do these numbers have?Each domain can indeed occur many times in a protein. Strictly speaking, what is shown is not the domain boundary but the signature boundary, i.e. that portion of the domain used to identify its existence by one of InterPro's methods. For some methods (e.g. Pfam and PROSITE profiles), the signature is equivalent to the entire domain, for other methods (e.g. PROSITE patterns) the signature is not. As there are many InterPro methods (some of which identify the same domain), it is possible that a single domain in a single protein will be multiply identified, e.g. where methods from ProDom, SMART, PROSITE and Pfam each identify the same domain. Of course, the same domain may also occur many times in one protein. I want to screen the InterPro database using taxonomic input rather than sequences input. How do I do this?We have just started to store taxonomic data in InterPro, retrieved from the UniProt entries. To get the proteins of interest, in the SRS- based text search it is possible to choose the "organism" from the protein entries and InterPro entry by name or accession number. The results may be retrieved in different formats. Alternatively you can also get the information you need directly from SRS(http://srs.ebi.ac.uk/): set up your "Views" to include InterPro matches. Once you have saved the view, go to the query page and from the UniProt database choose the taxonomic group of interest either by name or tax ID as the query (use "Extended" search). Once you have results, in the "Results" page select the InterPro View you saved and "Combine". This will list all the proteins from the taxonomic group with the corresponding InterPro matches. >From the InterProScan email, my protein sequence seems to have one domain hit to PS50324. But from the result from the website, it showed no match found. Can you explain this?Your protein matched PS50324, which is a PROSITE Pre-profile for Serine-rich regions. It is not really a domain, but a compositionally biased region, and thus not included in the InterPro database. But PS50324 is included in InterProScan to show you compositionally biased regions. I found the InterPro is very useful for me, but I have some difficult in use it. want to make a protein dataset of specified protein. For example transcription factor, but I don't know how to do this.If you have a protein set of interest like your example "transcription factor" you can do a text search in InterPro at: http://www.ebi.ac.uk/InterPro/search.html This will return a list of the related InterPro entries and you can choose the one of interest to you. The entry will have a matchlist of all proteins in UniProt belonging to this family or domain. Alternatively, on the search page there is also the SRS-based search option in which you can specify search fields in InterPro and UniProt. If you have an InterPro entry of interest you can select the accession number and you can also specify for example the description line or keyword in the protein sequence entry, then choose Seq Simple or Seq Entry as the results output. When you get the results there is an option to save the UniProt entries and the sequences. This will give you the dataset you require. When I was running the InterProScan linux version, some query results contain NULL iprid and domain name. Here is an example for running GLPT_ECOLI with .raw format, GLPT_ECOLI 0DE70D08D40AD445 452 BlastProDom PD003807 sp_P08194_GLPT_ECOLI 125 240 T 8-Oct-2002 NULL NULL What is the NULL for?While we try to integrate all signatures from the member databases, some signatures are not integrated for one of 3 reasons: 1) The signatures are brand new and are not yet integrated into InterPro by our curators; 2) there are some low complexity region signatures which will not be integrated but may be useful for users of InterProScan; 3) There are around 30 000 ProDom signatures, some of which contain just 1 member of the family/domain, so only the bigger, better families are integrated but many more are available in InterProScan. For those signatures not yet integrated into InterPro they have no corresponding IPR so the value NULL is given. In the case of ProDom entries, they do not have their own domain short names so when they are integrated they take on the short name of the InterPro entry they belong to, which is more biologically meaningful than the ProDom accession number. In your output the NULLs indicate that PD003807 is not yet integrated into InterPro for one of the above reasons, and therefore it has no domain short name. In the future we hope to treat each of the 3 cases differently, for example with other accession numbers or as UNINT rather than NULL, but currently it is not a design problem, it is just not very informative so will not necessarily be solved in the next version. GO MAPPINGS:How are InterPro entries mapped to GO terms?See this extract from our user manual: The assignment of GO terms to InterPro entries was done manually by reading the abstract of the entries and annotation of proteins in the protein match table for each entry. An appropriate GO term for an entry is one which applies to the whole protein. The GO terms associated with an InterPro entry applies to all proteins with true hits to the signatures in that entry. The assignments are incomplete and are ongoing due to the dynamic nature of the GO project. Some entries could be mapped to very low level (specific) GO terms, while entries describing wider families or common domains were mapped to higher level terms or could not be mapped at all. If the protein matchlist is completely uncharacterised/unannotated, then no GO terms are assigned. If there are some UniProt matches but they are annotated as hypothetical because the function is not known then they are mapped to the GO term molecular function unknown. I am looking for a way of getting GO mapping information for ~100 UniProt entries. How is the best way to get this data?There are a few different ways of retrieving UniProt's GO annotation. Please see the information on data retrieval at: http://www.ebi.ac.uk/GOA/ The GO Annotation project at EBI is called 'GOA'. At the moment we have released our GO annotation for our human data set. You can access GOA via SRS or download the association file on the GO home page or from EBI and GO ftp sites. I've been looking at the InterPro and/or Gene Ontology (GO) terms assigned to a variety of proteins, and especially at cases where a protein fails to get an InterPro or GO term assigned to it. In general, these failures make sense, but there are some that are not mapped while a parent entry is mapped to GO?The GO mapping of InterPro entries is ongoing and not complete, which may explain why some entries are not mapped when they could be. In the case of parent/child entries the child should be mapped to the parent GO terms if not a more specific term, but if they are not it may simply be because a curator has not yet visited that entry. I'm in the process of updating the InterPro to GO mappings that I have stored locally and noticed that some InterPro motifs that previously had GO terms associated no longer do so, is this correct?It is certainly possible for GO terms to be removed. Reasons for terms being removed include: the InterPro accession number becomes a secondary accession number; the method in the entry is altered and this changes it's scope of protein matches; correction of annotation in the entry; GO terms become obsolete etc.
|
|||||||||||
InterPro 35.0
|
||||||||||||