edible.txt
EDIBLE version 1.00
17 May 2000
How to use EDIBLE
-----------------
EDIBLE ("Experimental Design and Information By Likelihood
Exploration") is a computer program written in (hopefully strict) ANSI
C to implement phylogenetic information/experimental design ideas from
Goldman (1998).
The EDIBLE WWW page is reached via
http://www.zoo.cam.ac.uk/zoostaff/goldman/info/index.html.
To make sense of these notes and of EDIBLE in general, you will need
to have a copy of Goldman (1998) to hand. If you are registered for
electronic access to Royal Society journals, you can get this online
via http://www.royalsoc.ac.uk/publish/pro_bs/rpb1407.htm. There is
also a draft version of that paper available via
http://ng-dec1.gen.cam.ac.uk/info/index.html. You might also find it
helpful to have a copy of Massingham and Goldman (2000); a draft
version is included in this distribution (PostScript format).
These notes contain a lot of 'pseudo-LaTeX' notation: all you need to
know is that '^' means 'superscript' (often indicating 'raised to the
power of'), '_' means 'subscript', and curly brackets '{ ... }' mean
'keep the contents together'. The term 'pseudo-LaTeX' is (c) Nick
Goldman, 1999.
* Publication of results obtained from EDIBLE
Please cite Goldman (1998) with respect to the theory behind EDIBLE,
and Massingham and Goldman (2000) with respect to the software.
* To compile
For UNIX systems, an (extremely simple) batch file "Compiler" is
provided that should do the deed. For completeness,
| cc -Ox -o edible *.c -lm
should do the trick (assuming that no other C files are in the current
directory), where x is whichever optimization is required (3 normally
does the trick.
We have little experience of other platforms, but we have successfully
compiled EDIBLE under Windows NT (using Microsoft Visual C++ 6.0---
executable file included in this distribution). Command lines create
something of problem for Macs, but we think there is a solution using
the Metroworks compiler. We will work on this in the future.
* Files in this distribution of EDIBLE
Source files: edible.c llh.c
matrix.c options.c
partial.c random.c
read.c tree.c
utility.c
edible.h variables.h
Notes: edible.txt (these notes)
implementation.txt (further technical notes)
history.txt (history of EDIBLE)
m.and.g.2000.ps (Massingham and Goldman, 2000,
in PostScript format)
Examples: mtdna.base.tree mtdna.out
mtdnaEx4.base.tree mtdnaEx4.out
psieta.base.tree psieta.out
psieta2.base.tree psieta2.out
Compilation: Compiler
Command line options
--------------------
Most of the functionality of EDIBLE is controlled from the command
line (more because it seemed like a good idea at the time rather than
with any intention of allowing it to be scripted).
The command line dump (i.e. from running the program with no options)
is:
| Usage: edible <-s [sample_file] sample_size sequence_length percentile>
| <-m matrix_file> <-p prob_file> <-t> <-b boot_strap_size>
| <-d> <-h pi_a pi_c pi_g pi_t kappa> <-k> <-c cache_size>
| tree_file output_file
followed by brief notes on the meaning of the various command line
options. More extensive notes follow:
* Necessary files:
"edible" is the name of the executable to be run. If you downloaded
an executable and did not rename it, you probably have "W32edible" or
"DUNIXedible" instead.
"tree_file" is the name of the file which should contains a tree in
bracket-colon format (see below). This is the base tree for
information calculations.
"output_file" is fairly self-explanatory. EDIBLE creates a file of
the given name (overwriting if necessary) and keeps going until it
finishes or the hard disk dies.
* Optional: (not all fully-supported)
"-s" estimates a "percentile" percentile point of the information
distribution (instead of the mean per-site information, which is the
default). The distribution in question is the amount of information
per site observed from aligned sequences with "sequence_length"
nucleotides (sites). Monte Carlo simulation is used (not exact
calculations), as with the -b option (see below). The estimate is
based on "sample_size" samples, with the usual implications for
variances of estimates: the greater the value of "sample_size", the
more accurate the estimate of the percentile point but the slower the
calculation. Clearly, larger values of "sample_size" are needed for
extreme values (near to 0 or 100) of "percentile". "sample_file" is
an optional parameter, dumping the sample distribution ("sample_size"
ordered values of observed information per site). An example use of
the -s option might be "-s 1000 895 95", which requests an estimate of
the 95%-point of the information distribution, assuming sequences of
length 895 nucleotides and using 1000 samples to form the estimate.
"-m" dumps the intermediate expectation matrices to
"matrix_file". This can be a lot of numbers on long runs (numerophiles
only!). Each element is separated by a comma, each row by a new-line.
"-p" dumps to "prob_file" the probabilities calculated for each
possible pattern of nucleotides. If information about only one branch
(or node in a clock-like tree) has been requested then the partial
derivatives and information with respect to that branch/node will also
be dumped.
"-t" dumps each tree used to the main output file. Although this
option was really only devised for debugging purposes, it may be
helpful in clarifying what the interactive options 2,3 and 4 (below)
are doing.
"-b" turns on Monte Carlo sampling behaviour for large (and small)
trees. This option uses Monte Carlo simulations to generate alignment
sites according to the distribution defined by the tree analyzed,
calculates the information for these, and forms an estimate of the
mean information from this sample. In contrast to the default
behaviour, which uses an exact calculation, the -b option can save
time (and memory) at the cost of accuracy. The option name is
something of a misnomer now, but originally stood for 'bootstrap' (it
is akin to a parametric bootstrap). "boot_strap_size" controls the
number of samples used. This method is unaffected by machine integer
size-related problems with large trees (see below).
"-d" returns the determinant of the information matrix ( |E(I)| )
specified by all the parameters selected (see 'Tree and parameter
specification' below). Otherwise (the default), the diagonal elements
of the information matrix ( E(I_{ii}) ) corresponding to the selected
parameters are returned. If no parameters are explicitly specified,
the determinant of the matrix for all parameters is returned
irrespective of whether -d is specified on the command line. If
exactly one parameter is specified, then the determinant form is
equivalent to the diagonal element form.
"-h" tells EDIBLE to use HKY85 model of nucleotide substitution
(Hasegawa et al. 1985) rather than the default JC69 model (Jukes and
Cantor 1969). "pi_a", "pi_c", "pi_g", "pi_t" are the obvious base
frequencies, and are treated as constants (and not as parameters
regarding which information might be calculated). "kappa" is the
instantaneous transition/transversion rate ratio (beta/alpha in the
notation of Hasegawa et al.'s 1985 paper) and _is_ treated as a
parameter regarding which information might be calculated (see option
-k below).
"-k" turns _off_ calculations for information about kappa; the
default is to include kappa as a parameter regarding which information
might be calculated (this way round is sensible, regardless of what
the boss thinks).
"-c" creates a cache of the "cache_size" most probable results
generated while sampling. Especially for small trees, this has the
potential to reduce sampling times considerably.
EDIBLE running
--------------
Once successfully started, EDIBLE presents a small menu of options:
|Options: 1. Calculate the expected information of the tree
| 2. Scale tree by a given range of factors
| 3. Scale a single branch by a given range of factors
| 4. Slide a given branch between two others by a given range of distances
| Please choose:
Option 1 performs one information calculation, on the base tree
provided in "tree_file" with the command line options specified.
Option 2 scales the entire base tree by a given range of factors.
You are prompted to provide minimum/maximum factors and the number of
points to use, and EDIBLE performs that number of calculations with
the entire tree scaled by factors uniformly distributed between the
minimum and maximum values (inclusive).
Option 3 does the same as option 2, except that only one branch
length is scaled (the rest of the tree remaining constant). From a
computational point of view, this makes perfect sense with clock-like
trees: beware!
Option 4 allows one branch to be slid between two others, while
keeping the distance between those others constant. You are prompted
to specify which branch is to slide, which branch it is sliding
towards and which branch it is sliding away from. (Anyone commenting
about overspecification is probably correct.) For example, in order
to move branch 2 (which joins taxon B to the tree at node N1 [marked
as ',']) in Tree 0 away from taxon A and nearer to node N2 (';'), to
get something like Tree 1:
C C
: :
4 4
(N1) : (N1) :
A .1.,.....3.....;...5.... D ==> A .....1.....,.3.;...5.... D
: (N2) : (N2)
: :
2 2
: Tree 0 : Tree 1
: :
B B
you would specify that the branch to be moved is number 2; the branch
to move it along (towards) is number 3; and the branch to move it from
(away) is number 1. Note that the total distance from taxon A to node
N2 remains constant.
If the tree is clock-like, EDIBLE will correct the distances involved
to keep the evolutionary time to all other nodes constant (maybe best
to verify this with the -t option!).
Behaviour of option 4 on anything other than simple ('binary'---three
branches meet at the node in question) trees has undefined effects.
Tree and parameter specification
--------------------------------
* Tree specification
EDIBLE accepts trees written in the standard 'bracket' format. A tree
is written as branches separated by commas; each branch may either
lead to a leaf (taxon) or be another (sub)tree. The whole tree is
encased in brackets.
The format for a branch leading to a leaf is "name:length". 'name' is
the leaf name; length is the length of the branch to this leaf. The
name must be less than 16 characters and must not be of the form
"Node-n", where 'n' is an integer (such names being reserved for
internal nodes). Names are case sensitive and may be composed of all
standard alpha-numeric characters except white space, colons and
asterisks.
For a subtree, the format is as if it was a normal tree followed by
":length", where 'length' is the branch length to the subtree.
Branch lengths may be written as normal decimals or in exponential
format, the exponent separated from the decimal by 'E' or 'e'. For
example, 0.025674 = 2.5674E-02 = 0.25674e-1.
An important concept in what follows is the 'base node' of a tree
described in bracket notation. The base node is defined to be the
internal node at which the subtrees appearing in the outer-most
brackets (separated by commas) are connected.
For example, consider the tree:
A B with lengths 1 to A = 2.3
\ / 1 to C = 1.7
1---2 1 to 2 = 0.9862
/ \ 2 to B = 0.7138
C 3--Dog 2 to 3 = 0.71234
| 3 to Dog = 0.00146
E 3 to E = 0.00146
(_Note:_ the following expressions for this tree are not unique.)
Using node 1 as the base node, this can be written as:
(A:2.3,C:0.17e+1,(B:7.138e-1,(Dog:0.146e-2,E:1.46E-03):0.71234):0.9862) (1)
The program can be told to convert the unrooted form of the tree to a
rooted form. A node can be specified as the root by placing an 'r'
after the branch length of any branch leading away from that node
('away' being defined with respect to the base node). So, placing the
root in the branch from 1 to A in the example above and at a distance
2.0 from A gives:
(A:2.0r,(C:0.17e+1,(B:7.138e-1,(Dog:0.146e-2,E:1.46E-03):0.71234):0.9862):0.3) (2)
which we notice is now a rooted, molecular clock-like phylogeny (the
same total distance from the root to each leaf).
Alternatively, it is possible to specify simply which branch contains
the root node. The branch is indicated by placing an 'R' after its
length. This branch _must_ be one the branches connected to the base
node. For example, returning to the unrooted form of the example
above and specifying that the root is on the branch from A to 1:
(A:2.3R,C:0.17e+1,(B:7.138e-1,(Dog:0.146e-2,E:1.46E-03):0.71234):0.9862) (3)
If the root is to be placed instead on the branch from 2 to 3 in the
above example, we have to rewrite the tree so that this branch leads
from the new base node:
((A:2.3,C:0.17e+1):0.9862R,B:7.138e-1,(Dog:0.146e-2,E:1.46E-02):0.71234) (4)
If rooted trees are being used to study information scores for
molecular clock-like trees, it is the _user's_ responsibility to
ensure that the branch lengths given and the specified rooting do
indeed result in a clock-like tree. Beware of interactive option 3 in
particular, take care with option 4, and use the -t option to verify
you have got what you want! If more than one root is specified, the
behaviour of EDIBLE is undefined.
* Parameter specification
Under the JC69 model and using unrooted trees, the parameters
regarding which information can be calculated according to the method
of Goldman (1998) are simply the branch lengths of the tree. If the
information measure required is the determinant of the information
matrix for all branches, no further (explicit) parameter specification
is necessary (and the -d option is invoked automatically). If only
certain branches are to be used in a determinant calculation (option
-d required), or if information scores for an individual branch (or
branches) are required (instead of the determinant form), the branch
or branches in question are indicated by using an asterisk instead of
a colon preceding their length. For example, if we want to compute
the information regarding only the branch from 1 to A in tree (1)
above:
(A*2.3,C:0.17e+1,(B:7.138e-1,(Dog:0.146e-2,E:1.46E-03):0.71234):0.9862) (5)
or, if we want information regarding the branch from 3 to E in (1):
(A:2.3,C:0.17e+1,(B:7.138e-1,(Dog:0.146e-2,E*1.46E-03):0.71234):0.9862) (6)
When using rooted trees, the parameters change and are now the
positions of nodes within the tree (i.e. times before present---see
Goldman 1998). The same general rules apply regarding selection of
parameters and the choice between the determinant and individual
parameter information measures. Nodes regarding which information is
required are still indicated by asterisks: if a branch length is
marked by an asterisk, the node which is at the end of that branch
_towards_ the base node of the tree is selected. For example, if we
want to compute the information regarding the node at which Dog and E
meet in tree (2) above:
(A:2.0r,(C:0.17e+1,(B:7.138e-1,(Dog*0.146e-2,E:1.46E-03):0.71234):0.9862):0.3) (7)
If information about the length leading to the root node is desired,
this may be indicated by an asterisk after the outer-most closing
bracket. The result of placing an asterisk here in the unrooted case
is undefined.
If no parameters are explicitly specified, information scores are
computed for all parameters and the determinant measure ( |E(I)| ) of
overall information is reported (option -d invoked automatically). If
one or more parameters (indexed 'i') are explicitly specified, then
either the information scores for each ( E(I_{ii}) ), irrespective of
other parameters, are reported (default behaviour), or else the
determinant measure for these parameters is reported (if option -d is
specified). If exactly one parameter is explicitly specified, then
the determinant measure is equivalent to the diagonal element measure.
In this version of EDIBLE, when using the HKY85 model (-h option) the
information regarding the instantaneous transition/transversion rate
ratio kappa can also be computed. (This can be controlled with the -k
option.) Information regarding kappa is treated separately from other
parameters (as there are no natural or meaningful units for comparing
substitution rate ratios with evolutionary times), and is always
reported in the form E(I_{kappa,kappa}) and never used in determinant
calculations.
Portability considerations and other features
---------------------------------------------
For exact calculations (default information calculations; not using
the sampling options -s or -b), tree size is limited by machine
integer size. Most current machines have 31-bit (signed) integers
(let's see how quickly this looks dated!). For exact calculations we
need to differentiate the probability of every possible nucleotide
pattern, and so EDIBLE is limited to 15-leaf trees on most machines.
This rises to 31 leaves on platforms such as the DEC Alpha and most of
the next generation of microprocessors. EDIBLE's implementation of
Monte Carlo estimation of information values does not have these
limitations.
On most machines memory will not be a concern: EDIBLE requires less
than 1Mb for a 13-leaf tree.
Example files
-------------
This distribution of EDIBLE contains four example base tree files
(mtdna.base.tree, mtdnaEx4.base.tree, psieta.base.tree,
psieta2.base.tree) and four corresponding output files (mtdna.out,
mtdnaEx4.out, psieta.out, psieta2.out). These base trees were used in
the preparation of figures 2c and 3b of Goldman (1998).
* mtdna.base.tree
This is the 5-taxon tree of Goldman (1998: fig. 2a), used to calculate
the 'baseline' per-site information score relating to the node 'G'
where gorilla joins that tree. Using the command line:
| edible mtdna.base.tree Test1.out
and selecting interactive option 1 should result in an output file
Test1.out which ought to match mtdna.out. This output should contain
the information score of approximately 80.89 referred to by Goldman
(1998:1783).
* mtdnaEx4.base.tree
This is a 6-taxon tree used to generate part of figure 2c of Goldman
(1998). Using the command line:
| edible mtdnaEx4.base.tree Test2.out
and then selecting interactive option 4 followed by: (enter only the
numbers; the remainder of each line given here is a comment)
| 5 # move branch 5 (leads to Extra4)...
| 6 # ...along branch 6 (connects the branches leading to Extra4 and orang)...
| 4 # ...and away from branch 4 (connects the branches leading to gorilla and Extra4)...
| 0 # ...by a distance between 0...
| 0.0412874 # ...and 0.0412874 (inclusive)...
| 99 # ...in 99 (equal) steps (i.e. 0, 0.0004213, 0.0008426, ..., 0.0412874)
should result in an output file Test2.out which ought to match
mtdnaEx4.out. The resulting per-site information scores, when
multiplied by 895 to convert E(I_G) into E^{tot}(I_G), give the values
plotted in section 4 of figure 2c of Goldman (1998).
* psieta.base.tree
This is the 6-taxon tree of Goldman (1998: fig. 3a). No parameters
are explicitly specified, so option -d is automatically invoked and
the determinant information measure calculated for all parameters.
The command line:
| edible psieta.base.tree Test3.out
followed by interactive option 2 and then:
| 0.1 # scale entire tree by a factor ranging from 0.1...
| 10 # ...to 10...
| 100 # ...in 100 (equal) steps (i.e. 0.1, 0.2, ..., 10)
should result in an output file Test3.out which ought to match
psieta.out. The information scores contained in psieta.out, raised to
the power 1/9, are the values of |E(I)|^{1/9} plotted for rate factor
mu <= 10 in figure 3b of Goldman (1998).
* psieta2.base.tree
Figure 3b of Goldman (1998) also shows E(I_h) and E(I_{sm+rm}). To
calculate these values, psieta2.base.tree specifies the appropriate
parameters using asterisks. Using the command line:
| edible psieta2.base.tree Test4.out
followed by interactive option 2 and then:
| 0.1 # scale entire tree by a factor ranging from 0.1...
| 10 # ...to 10...
| 100 # ...in 100 (equal) steps (i.e. 0.1, 0.2, ..., 10)
should result in an output file Test4.out which ought to match
psieta2.out. The information scores contained in psieta2.out are
exactly the values of E(I_h) and E(I_{sm+rm}) plotted for rate factor
mu <= 10 in figure 3b of Goldman (1998).
References
----------
Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a
maximum likelihood approach. J. Mol. Evol. 17, 368--376.
Goldman, N. 1991. Statistical Estimation of Evolutionary Trees.
Ph.D. Thesis, Department of Zoology, University of Cambridge.
Goldman, N. 1998. Phylogenetic information and experimental design
in molecular systematics. Proc. R. Soc. Lond. B 265, 1779--1786.
Hasegawa, M., H. Kishino and T. Yano. 1985. Dating of the human-ape
splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol.
22, 160--174.
Jukes, T.H., and C.R. Cantor. 1969. Evolution of protein molecules.
Pp. 21--132 in H.N. Munro (ed.), Mammalian protein metabolism,
Vol. 3. Academic Press, New York.
Massingham, T., and N. Goldman. 2000. EDIBLE: experimental design
and information calculations in phylogenetics. Bioinformatics.