spacer

PRANK: Probabilistic Alignment Kit

Introduction


PRANK is a command-line program that contains the latest features and the complete set of options. It is ideal for scripting and non-interactive work, and as it supports the MSAML-formatted output, the resulting alignments can always be browsed using the graphical front-end in PRANKSTER.

PRANK is developed on Linux but works also on MacOSX and Windows.




NEW: PRANK development has moved to Google Code. The new prank-msa site contains the latest version of the program source code and allows entering comments and bug reports. The new version of PRANK available there includes bug fixes and significant speed improvements.



Using PRANK


Disclaimer

PRANK has been developed and best tested on Linux. Precompiled binaries are provided for MacOSX and Windows; it should also compile fine on these platforms if the necessary tools are available. The author is not taking any responsability of possible damage that the software may cause to your computer, scientific career, family life or anything else.


Download

PRANK written in C++. The code is © Ari Loytynoja and distributed under the GPL; an exception are the eigen routines and the sequence input/output functions that come from PAML and readseq packages and are © Ziheng Yang and Don Gilbert, respectively.

The PRANK source code and precompiled binaries can be downloaded from here.


Installation

On Linux/Unix systems, the code can be unpacked and compiled using commands:

tar xvzf prank.src*.tgz
cd src/
make

See here for some instructions for compiling the code on a Mac OSX.

The software is still under development and, in addition to lacking much error checking, may contain bugs.

If you wish to use the alignment software for your own studies, please send an email to Ari Loytynoja to be kept up to date with the bug fixes and improvements.


Using PRANK

The minimal command is prank filename where the file 'filename' contains more than one sequence in a format supported by the program readseq. Type prank -help to see a brief description of optional parameters.

Models

For DNA data, PRANK by default uses HKY model with empirical base frequencies and kappa=2. With the optional command parameters, it supports TN (TN93) and models below it (JC, K2P, FEL, HKY). For example, JC model is defined as -kappa=1 -dnafreqs=0.25,0.25,0.25,0.25. WAG (WG01) is used for protein alignments but, so far, the software isn't much tested for that and other programs may do better. For protein coding DNA data, one can also use the empirical codon substitution model (kindly provided by Carolin Kosiol). Translation into codons is done in the first forward frame without any error-checking.

Simulation studies with nucleotide sequences containing high numbers of insertions and deletions showed that the option '-F' (i.e. the model "+F" as defined in the paper) gives the most accurate results and should generally be used.

Guide tree

Progressive alignment requires a guide tree, and PRANK can construct a tree using Neighbor Joining algorithm and evolutionary distances estimated from fast pairwise alignments. If you don't specify a guide tree, PRANK runs the alignment twice: (1) it generates a tree from unaligned data, (2) makes a multiple alignment, (3) generates a new guide based on the given alignment, and (4) makes an improved multiple alignment. The alignments produced at the stage (2) and (4) are named e.g. as filename.1.fas and filename.2.fas (the suffix automatically added in the end of the file name depends on the format chosen). You may also export a PRANK alignment, use a phylogeny software to infer a tree, import that (rooted) tree in PRANK, and realign the data. To prevent PRANK running twice, use the flag '-once'.

If you know the correct phylogeny, import the tree with branch lengths and use it for alignment. The PRANK algorithm uses insertion-deletion events as phylogenetic information and the results may be very sensitive to the given topology.

Anchoring

The standard PRANK algorithm is based on an exhaustive search of the best pairwise solution, and for long sequences this soon becomes too time consuming. The command-line version of the algorithm includes an experimental anchoring option that may radically reduce the computation time and allow for aligning sequences up to hundreds of kb's. This option uses the anchoring algorithm chaos from the lagan package and requires the program rechaos.pl to be on your execution path. (Note that some versions of the rechaos.pl program have a bug in the routine reading the input flags and the anchoring fails. Try using this edited version of the program instead of the original.)

If this softare installed is on your system, the option -a calls the anchoring program and breaks the search space into shorter fragments. The lenght of the fragments can be controlled with the parameters -mind=#, and the distance to the anchor before dropping it (to avoid constraining the alignment by potentially mis-aligned anchor sites) is set with -dropd=#. The default values are 200 and 50, respectively. For relatively short sequences, anchoring can be forced using flags such as -maxd=50 -mind=20 -dropd=10 -skipd=100. Aggressive anchoring significantly speeds up the alignment but may also affect the alignment result and cause error.

Ancestral sequences

The PRANK algorithm infers the insertion-deletion events while aligning the sequences, and this information and the inferred ancestral sequences can be outputted in two formats. With the option -writeanc, PRANK outputs two additional files:

  • the file *.ancseq contains the topology of the guide tree followed by the aligned extant sequences and inferred ancestral sequences in FASTA format. If the first line of the file is removed, the sequences can be displayed e.g. using PRANKSTER.

  • the file *.ancprof contains the relative probabilities of characters for each ancestral node. The columns of the output are:

    1. event ('-' deletion, '*' insertion, '+' skipped insertion, or ' ' normal match)

    2. site

    3. ML structure state

    4. relative probabilities of different characters (or "character states") for each structure state/process.

You may notice that in the file *.ancprof, character probabilities are computed for insertions ('*' or '+') but not for deletions ('-'). That is simply because deleted characters don't leave any descendants (from which to compute the values) but any single insertion *could* be explained with multiple independent deletions and, thus, have an ancestor. In most cases, any site marked with either '-', '*' or '+' should be considered non-existing and ignored.

Speed-ups

By default, the PRANK algorithm doesn't use log-space for the likelihood calculation. This makes the program run faster but may cause underflow problems with larger datasets (>>100 rather distant DNA sequences, fewer protein sequences). By using the log-space (with the flag -uselogs), PRANK has been confirmed to successfully align more than 550 DNA sequences.


Methods

For homogeneous models (default), the method corresponds to that published in (LG05). However, PRANK can also use complex models for alignment and infer sequence structure along with the alignment. You can build some models here.


References

TN93. Tamura K, Nei M. 1993. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. MBE 10:512-526.
WG01. Whelan S, Goldman N. 2001. A General Empirical Model of Protein Evolution Derived from Multiple Protein Families Using a Maximum-Likelihood Approach. MBE 18:691-699.
LG05. Loytynoja A, Goldman N. 2005. An algorithm for progressive multiple alignment of sequences with insertions. PNAS 102:10557-10562.



Back to the front page.    Comments? E-mail ari@ebi.ac.uk.

spacer
spacer