spacer

HSAML format

Introduction


Alignments generated with PRANK and PRANKSTER contain more information than can be stored using the traditional sequence alignment formats, so we have defined a new alignment format called HSAML. Both PRANK and PRANKSTER support saving data in the HSAML format, and PRANKSTER provides a convenient graphical front-end to open and browse these alignments.

The HSAML format follows the XML standards and, in addition to PRANKSTER, any program supporting the XML is able to open and manipulate the alignments. Two of the possible software packages to handle the data are R (see here), a free software environment for statistical computing and graphics, and perl (see here), a cross-platform programming language. We will not go into details of R and perl here but provide simple examples how to manipulate HSAML-formatted data using these two packages. Which ever is your choice of a programming language, it is strongly recommended that you use one of the existing XML parsers to import the HSAML alignments and will not try to read them as text files!



Definition


The XML schema gives a formal definition of the HSAML format.

An alignment generated by PRANK/PRANKSTER consist of four elements: newick, nodes, selection and model; sub-elements of nodes are either leaf or node. Colours refer to this example alignment.

The element newick defines the alignment guide tree (or phylogeny) in the 'newick' format. The tree should be complete and have a name for each node (terminal and internal) and a branch length for each branch. A node name and branch length are separated by a colon, two sister nodes separated by a comma and surrounded by a pair of brackets. A node name can be any alpha-numeric string (A-Za-z0-9 + some other characters) except that the names for internal nodes are always surrounded by hash signs (#) and the names for terminal nodes never contain hashes. It is convenient to keep the node names short and provide more information in nodes.

The element nodes contains terminal and internal nodes. A terminal node is defined in leaf and has attributes 'id', which matches a node name in the newick tree, and 'name',the full name of the sequence. The only content of a leaf is 'sequence', an aligned sequence. An internal node is defined in node and has one attribute, 'id' that matches a node name in the newick tree. The contents of a node are of the type 'probability', each having the attribute 'id' matching one of the process id's in the element model.

The element selection is optional and only appears when a user has deselected some of the alignment sites using filters based on the sitewise posterior probabilities. Selection consists of 'selected_sites', a boolean vector defining if a site is selected/deselected, and 'selection_criteria', a description of the process resulting to the current selection.

The element model has two purposes: first, it gives further information of the processes for which nodes contain posterior probabilities for (attributes 'id' and 'name'), and, second, it specifies how PRANKSTER displays this information. The meaning of different attributes becomes obvious if one plays with PRANKSTER (Settings->Plot), and the range of possible attribute values is defined in the XML schema.

Examples

R

This R script plots the posterior probability of one of the processes across the alignment sites. If you download the script here and the example alignment here, start R and type the command

source("plot_xml.rs")

you should obtain a plot similar to this:

(To reproduce the plot shown here, you should download this script.)

The plotted value is the posterior probability of the slow process across the alignment sites in the three internal nodes. In the two latter plots, the broad peaks match rather well with the known coding exons that are indicated by the black bars in the top (not included in the example script).

Note that the script requires XML package to be installed. R packages can be found from the project homepage following the links to CRAN and Packages. Consult the R manual to learn more about the package installation.


Perl

This perl script demonstrates how to select nodes and clades of an alignment tree, manipulate sequences, extract posterior probability values and work with names and id's. A detailed explanation of the use of the script and the functions implemented in the module is given in the source code file.

The script and the additional module required by the script can be downloaded from here and here, and the use of the script tested with this example dataset. Typing perl parseExample.pl should read in the alignment and print out the location and sequence of the sites for which the alignment reliability score is below 50% in the human-chimp subtree.

Note that the script requires the XML::libXML module to be installed (download it from here). Consult the perl manual to learn more about the package installation.



Acknowledgements


The XML schema was defined by Nicolas Rodriguez. Contact rodrigue@ebi.ac.uk for details.


To the PRANK front page.    Comments? E-mail ari@ebi.ac.uk.

spacer
spacer