spacer
spacer

Differences

This shows you the differences between two versions of the page.

help:interproscan [2013/04/23 16:25]
hpm (172.22.68.212)
help:interproscan [2013/04/30 11:11] (current)
hpm (172.22.68.212)
Line 7: Line 7:
===== Batch usage ===== ===== Batch usage =====
-Unlike the standalone version of InterProScan, the InterProScan web services ([[..:services:pfa:iprscan_rest|InterProScan (REST)]] and [[..:services:pfa:iprscan_soap|InterProScan (SOAP)]]) only accept a single sequence as input. This limitation is a results of extensive throughput testing, which has shown that the InterProScan services can process more sequences if each job contains only a single sequence. Thus to process multiple sequences they have to be submitted individually, there are a number of ways to do this:+Unlike the standalone version of InterProScan, the InterProScan web services ([[..:services:pfa:iprscan_rest|InterProScan (REST)]] and [[..:services:pfa:iprscan_soap|InterProScan (SOAP)]]) only accept a single sequence as input. This limitation is a result of extensive throughput testing, which has shown that the InterProScan services can process more sequences if each job contains only a single sequence. Thus to process multiple sequences they have to be submitted individually, there are a number of ways to do this:
  - Use a tool such as [[http://www.blast2go.org/|Blast2GO]] which implements a mechanism for running InterProScan jobs in parallel.   - Use a tool such as [[http://www.blast2go.org/|Blast2GO]] which implements a mechanism for running InterProScan jobs in parallel.
-  - Most of the sample clients for the InterProScan web services have an option (''--multifasta'') which enables serial processing of a set of input sequences in [[http://www.ebi.ac.uk/2can/tutorials/formats.html#fasta|fasta sequence format]]. So a large set of sequences can either be processed serially, or broken up into sections and processed in parallel. +  - Most of the sample clients for the InterProScan web services have an option (''--multifasta'') which enables serial processing of a set of input sequences in [[http://en.wikipedia.org/wiki/FASTA_format|fasta sequence format]]. So a large set of sequences can either be processed serially, or broken up into sections and processed in parallel. 
-  - The query sequences can be submitted individually, say by breaking the set of sequences into files containing a single sequence each and then running a job for each file.+  - The query sequences can be submitted individually, say by breaking the set of sequences into files containing a single sequence each (for example using [[http://emboss.open-bio.org/rel/rel6/apps/seqretsplit.html|EMBOSS seqretsplit]] and then running a job for each file.
  - Use a workflow system ([[tutorials/07_workflows|Workflows]]) to manage the jobs.   - Use a workflow system ([[tutorials/07_workflows|Workflows]]) to manage the jobs.
Line 24: Line 24:
To process nucleotide sequences using InterProScan: To process nucleotide sequences using InterProScan:
-  - **Translate your nucleotide sequence**: the standalone version of InterProScan uses [[http://emboss.sourceforge.net/apps/release/6.3/emboss/apps/sixpack.html|EMBOSS sixpack]] to perform the translation and filter the resulting open reading frame (ORF) sequences by length. Alternative tools such as [[http://emboss.sourceforge.net/apps/release/6.3/emboss/apps/getorf.html|EMBOSS getorf]] and [[http://emboss.sourceforge.net/apps/release/6.3/emboss/apps/transeq.html|EMBOSS transeq]] are also available, but may require an additional filtering process to limit the ORF sequences to those above a certain length. These tools are available as part of [[:soaplab:overview|Soaplab]], in [[:services:emboss|WSEmboss]] and as part of the [[http://www.emboss.org/|EMBOSS]] package. +  - **Translate your nucleotide sequence**: the standalone version of InterProScan uses [[http://emboss.open-bio.org/rel/rel6/apps/sixpack.html|EMBOSS sixpack]] to perform the translation and filter the resulting open reading frame (ORF) sequences by length. Alternative tools such as [[http://emboss.open-bio.org/rel/rel6/apps/getorf.html|EMBOSS getorf]] and [[http://emboss.open-bio.org/rel/rel6/apps/transeq.html|EMBOSS transeq]] are also available, but may require an additional filtering process to limit the ORF sequences to those above a certain length. These tools are available in [[:soaplab:overview|Soaplab]] and as part of the [[http://emboss.open-bio.org/|EMBOSS]] package. 
-  - **Filter ORFs by sequence length**: short sequences (<80 aa) are unlikely to have any signature matches, so unless there is additional evidence that the sequence occurs, short sequences can be discarded. The [[http://emboss.sourceforge.net/apps/release/6.3/emboss/apps/sixpack.html|EMBOSS sixpack]] and [[http://emboss.sourceforge.net/apps/release/6.3/emboss/apps/getorf.html|EMBOSS getorf]] tools provide options to perform length filtering when performing the translation.+  - **Filter ORFs by sequence length**: short sequences (<80 aa) are unlikely to have any signature matches, so unless there is additional evidence that the sequence occurs, short sequences can be discarded. The [[http://emboss.open-bio.org/rel/rel6/apps/sixpack.html|EMBOSS sixpack]] and [[http://emboss.open-bio.org/rel/rel6/apps/getorf.html|EMBOSS getorf]] tools provide options to perform length filtering when performing the translation.
  - **Significant hits from sequence similarity searches**: the signatures used by InterProScan are based on known protein sequences so a filtering step by performing a BLAST or FASTA sequence similarity search with the ORF translations against the UniProtKB or UniParc protein sequence databases and only keeping sequences which have hits with E-values <0.001. In the case where an exact match is found to the sequence, you can go directly to the InterPro Matches databases (available in [[http://www.ebi.ac.uk/Tools/dbfetch/|dbfetch]], [[..:services:dbfetch|WSDbfetch]] and [[http://srs.ebi.ac.uk/|SRS@EMBL-EBI]] and from the EMBL-EBI FTP site: ftp://ftp.ebi.ac.uk/pub/databases/interpro/) to get the signature matches for the sequence.   - **Significant hits from sequence similarity searches**: the signatures used by InterProScan are based on known protein sequences so a filtering step by performing a BLAST or FASTA sequence similarity search with the ORF translations against the UniProtKB or UniParc protein sequence databases and only keeping sequences which have hits with E-values <0.001. In the case where an exact match is found to the sequence, you can go directly to the InterPro Matches databases (available in [[http://www.ebi.ac.uk/Tools/dbfetch/|dbfetch]], [[..:services:dbfetch|WSDbfetch]] and [[http://srs.ebi.ac.uk/|SRS@EMBL-EBI]] and from the EMBL-EBI FTP site: ftp://ftp.ebi.ac.uk/pub/databases/interpro/) to get the signature matches for the sequence.
-**Note**: the standalone version of InterProScan can perform the translation and ORF length filtering as part of the submission and is recommended if you need to perform large numbers of analysis and have access to the required resources. See [[ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/README.html|InterProScan Readme]] for details. +**Note**: the standalone version of InterProScan can perform the translation and ORF length filtering as part of the submission and is recommended if you need to perform large numbers of analysis and have access to the required resources. See [[ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/4/README.html|InterProScan Readme]] for details.
 
help/interproscan.1366730728.txt · Last modified: 2013/04/23 16:25 by hpm
spacer
spacer