Getting started with Exonerate
Here we provide some examples of using Exonerate to perform various types of pairwise comparison.
Remember to use -h to get a short summary of available options, or --help for a longer summary.
Further information can be found on the exonerate man page.
Simple ungapped alignment
Ungapped alignment is the default for Exonerate, so you only need to supply the query and target sequences:
- exonerate query.fasta target.fasta
It's that simple.
Gapped alignment against a database
To generate gapped alignments, we have to specify an alignment model. The model used here is an affine local alignment model, which is equivalent to the classic Smith-Waterman-Gotoh type of alignment.
- exonerate --model affine:local query.fasta target.fasta
Aligning a cDNA to a genomic sequence
To align a cDNA to genomic sequence, use the est2genome model.
- exonerate --model est2genome query.fasta target.fasta
This will allow introns to be placed in the alignment, allowing exons on the genomic sequence to align to the query.
There are many other models that can be used. See the man page for a complete list.
Aligning a protein to genomic sequence
Similarly, it is possible to align a protein sequence to the genome, (similar to GeneWise, but with heuristics).
- exonerate --model protein2genome query.fasta target.fasta
This model will allow introns in the alignment, but also allow frameshifts, and exon phase changes when a codon is split by an intron.
6-frame translated alignment
The coding2coding model aligns two DNA sequences by translated of both the query and target in all reading frames, while allowing for gaps and frameshifts.
- exonerate --model coding2coding query.fasta target.fasta
When aligning very large sequences, (chromosome-sized), using the --bigseq option allows you to generate genome alignments quickly. When doing this it is also a good idea to raise the hsp threshold to reduce the number of HSPs produced.
- exonerate -m a:l --bigseq yes --dnahspthreshold 120
This can be done with any alignment model.
Exhaustive Smith-Waterman-Gotoh alignment
By default, exonerate will use heuristics to generate alignments, however for the maximum sensitivity, exhaustive alignment should be used.
- exonerate -m a:l --exhaustive yes query.fasta target.fasta
Note this requires quadratic time, so will be slow for very long sequences (although it will only require linear memory).
Alignment including non-equivalenced regions
Another model provided by exonerate is the NER model, which generates alignments where colinear blocks are joined by non-equivalenced (unaligned) regions.
- exonerate -m ner query.fasta target.fasta
Advanced examples using Exonerate
Here we show more advanced examples of using Exonerate to perform various types of pairwise comparison.
Remember to use -h to get a short summary of available options, or --help for a longer summary.
Further information can be found on the Exonerate man page.
Sequence input options
There are several ways to supply input to exonerate. The simplest way is as shown:
- exonerate query.fasta target.fasta
These inputs can contain single or multiple sequences. Exonerate will compare all queries with all targets. You can also specify using multiple queries or targets like this:
- exonerate --query q1.fa q2.fa t2.fa --target t1.fa t2.fa t3.fa
Furthermore, if the input is a directory, exonerate will recursively read all fasta sequences contained within that directory. By default, all files with a .fa extension are assumed to be fasta files, but you can change that behaviour using --fastasuffix like this:
- exonerate -q query_dir -t target_dir --fastasuffix .fasta
By default, exonerate will try and guess which inputs are DNA and which are peptide, by making the assumption that if the first sequence contains >=85% ACGT, then the database is nucleotide. This assumption can be wrong, and the test may be slow if the first sequence is an entire chromosome, so the sequence type may be specified using the --querytype and --targettype options, such as in this example:
- exonerate -q qy.fa -t tg.fa --querytype dna --targettype protein
Using exonerate on a compute farm
To achieve simple course-grained parrallelism with exonerate on a compute farm, it is often desirable to split databases up into chunks and compare each chunk on a different farm node. Exonerate simplifies this process by performing this chunking internally. This can be done by using the arguments --querychunkid and --querychunktotal, for splitting up the query database. This means you do not need to decide on the job granularity when you are copying the data onto the farm. For example, if you wished to split a query database into 8 parts, you would need to run one of these commands on each node.
exonerate q.fa t.fa --querychunkid 1 --querychunktotal 8 exonerate q.fa t.fa --querychunkid 2 --querychunktotal 8 exonerate q.fa t.fa --querychunkid 3 --querychunktotal 8 exonerate q.fa t.fa --querychunkid 4 --querychunktotal 8 exonerate q.fa t.fa --querychunkid 5 --querychunktotal 8 exonerate q.fa t.fa --querychunkid 6 --querychunktotal 8 exonerate q.fa t.fa --querychunkid 7 --querychunktotal 8 exonerate q.fa t.fa --querychunkid 8 --querychunktotal 8
Picking output formats
By default, exonerate will report alignments in a human-readable format and also in 'vulgar' format. you can turn these outputs off like this:
- exonerate --showvulgar no --showalignment no ...
You can also specify other output formats, such as 'cigar' lines, or GFF output on the query or target sequences:
- exonerate --showcigar yes --showquerygff yes
In addition to these output formats, it is possible generate output in a format of your choosing by using the --ryo option.
The --ryo (roll your own) option allows you to print various fields using a printf-like syntax. For example if you wished to dump the portion of the target sequence which appears in the alignment in FASTA format, you could use an --ryo line like this:
- exonerate --ryo ">%ti (%tab - %tae)\n%tas\n"
... to produce output such as this:
>hshcf2.embl (13651 - 13802)
Using this allows you to generate easy-to-parse output tailored to you needs. A full list of the fields which can be used is in the Exonerate man page.
Applying score thresholds
There are several ways to apply score threshold in exonerate. Applying a sensible score threshold not only reduces the number of spurious alignments, but will also make the searches run faster due to the way that BSDP works.
--score : a simple score threshold
--percent : report alignment over a percentage of the maximum score attainable by each query
--bestn : report the best N matches for each query.
- exonerate --score 500
Increasing search sensitivity
There are several strategies that may be employed to increase search sensitivity. However, they all do so at the expense of speed. They are:
Decreasing the word threshold: This reduces the number of words admitted into the word neighbourhood. This can be done with the --dnawordthreshold and --proteinwordthreshold options.
Decreasing the hsp threshold: This raises the HSP score required for consideration in building of alignments. This can be done with the --dnahspthreshold and --proteinhspthreshold options.
Increasing the sizes of the SARs around HSP ends: This can be done with the options: --terminalrangeint--terminalrangeext--joingangeint--joinrangeext--spanrangeint--spanrangeext.
Gapped extension: By using -g, you can force exonerate for perform a full gapped HSP extension. This is probably the best option, but will not be fast until version 1.1.
Exhaustive alignment refinement: Exonerate can be made to perform exhaustive alignment over the region in which the heuristic alignment was found. This is done by using the option:
- exonerate --refine region
Making quick-and-dirty alignments
- exonerate -Q dna -T dna -m unagpped -w 14 --fsmmemory 512
--dnawordthreshold 0 --dnahspthreshold 140
Also, using the --bigseq option
Using Exonerate in Client-Server mode
Here we look at some examples of using Exonerate in Client-Server mode. Further information can be found on the Exonerate man page.
Normally when Exonerate is run, the query and target sequences are used as fasta files. The run time of any search at least be proportianal to the size of these input files. However, the exonerate-server allows a set of sequences and an index of words which they contain to be held in memory. The server may be used in place of the target database on the exonerate command line, which allows very fast searches to be conducted (with run time independent of the size of the database).
To run the exonerate with exonerate-server, you need to build a database file and an index file, and start the server using these files. These steps are explained below.
1. Building the database (.esd) file
Firstly, a .esd (exonerate sequence database) file must be generated from a fasta format input file. This is done using the fasta2esd utility provided with exonerate. For example:
- fasta2esd genome.fasta genome.esd
Alternatively, many fasta files may be used to generate a single .esd file as shown here:
- fasta2esd --fasta chr1.fa chr2.fa chr3.fa --output genome.esd
By default, it is assumed that the input is softmasked. You may turn off this feature (as shown below), so that masked parts of the sequence are included in the resulting indices, although this will slow searches as the indices will be much larger:
- fasta2esd --softmask no genome.fasta genome.esd
The .esd file only links to the fasta files, so will typically be very small. The source fasta files must be left in place for the indexing stage and running the server.
Typical run time for fasta2esd on the human genome: 90 seconds
2. Building the index (.esi) file
Once the .esd file is ready, the .esi (exonerate sequence index) file can be generated using the esd2esi utility. This contains the index of word positions from the input sequences:
- esd2esi genome.esd genome.esi
If the type of exonerate search to be used is translated (ie. where DNA input generates protein alignments, eg. with the protein2genome model), a translated index must be generated (an index of the words in the translated frames of the input):
- esd2esi genome.esd genome.trans.esi --translate yes
By default, the esd2esi step will use up to 1GB (1024MB) of memory for the indexing step. The more memory is available, the fewer passes of the input data are required, and the faster the indexing is completed: This default may be increased or decreased by the parameter --memorylimit:
- esd2esi genome.esd genome.esi --memorylimit 2048
It is also possible to set other parameters which affect the seed words and word neighbourhood. Parameters which may be set in esd2esi will be ignored when exonerate is run using exonerate-server. For a full list of parameters which may be set by esd2esi may be seen by using esd2esi -h For example:
- esd2esi genome.esd genome.esi --dnawordlen 11 --wordjump 3 --saturatethreshold 100
Typical run time for esd2esi on the human genome: 30 minutes (for a translated index: 85 minutes)
3. Starting the server
Once an .esi file has been prepared, the server may be started:
- exonerate-server genome.esi
By default, the server will run on port 12886, but an alternative port may be specified on the command line:
- exonerate-server genome.esi --port 12345
The server is multithreaded - once a connection has been accepted, it will be handled in a separate thread. By default, upto four connections are permitted at once, but this may be changed on the command line:
- exonerate-server genome.esi --maxconnections 10
Typical startup time for exonerate-server on the human genome: less than one minute
The memory requirements of the server are minimised as both the sequences and indices are held in memory in a compressed form. Typically, for a softmasked human genome, about 3GB of memory is required, or 4.5GB when using a translated index)
4. Running the client
Finally, once the server is running, it may be used for fast queries using Exonerate. The server is specified on the command-line in place of the target sequences in the form hostname:port :
- exonerate query.fa localhost:12886
The rest of the command line used is the same as if exonerate was running without the server. In example below, the est2genome model is used and the geneseed parameter is set to 250. Parameters which can be used by the server, such as --geneseed in this example, are passed to the server to accelerate the query. For a typical cDNA of 1.5kb, this query is answered in less than a second.
- exonerate est.fa localhost:12886 --model e2g --geneseed 250
Similarly, a protein query may search a server using a translated index. For a typical 400 residue protein, this query takes less than a second.
- exonerate protein.fa localhost:12886 --model p2g --geneseed 250
The exonerate-server may also be used for analyses using a large number of queries, such as the example below searching a genome using short (36bp) reads. Typically, about 60 reads can be queried against the genome per second, ie. < 30 minutes for 100000 example queries below (on a single CPU).
- exonerate solexa.100000.fa localhost:12886 --model a:l --seedrepeat 4 --dnahspthreshold 90 --score 120
Memory usage of the client is normally very low, as most of the data is held server-side.
5. Running exonerate-server under LSF
You need to tell both LSF and exonerate-server how many CPUs you wish the server use. eg. to run on 2 CPUs:
- bsub -n2 -R'span[hosts=1]' exonerate-server --maxconnections 2 genome.esi
If you are running exonerate-server under LSF, it is easy for the memory limits to be exceeded, which will result in your job being killed. To avoid this, the server should be started specifying the maximum ammount of memory which will be used. eg. allowing up to 4GB of memory for the server:
- bsub -M4000000 -R'select[mem>4000] rusage[mem=4000]' exonerate-server genome.esi
Once the server is running, the client jobs may be directed at the host running the server job, and submitted.
As soon as the client jobs are completed, the server job may be terminated.
(Please note that prior to version 2.1, exonerate-server did not work well with LSF.)