WARNING
- These are the steps that we have used to perform the SOWH test.
There is currently no automated "easy" was to perform
this test so you may have to manipulate very large data files
in order to perform the test. These notes describe the use of
the computer programs Seq-Gen (Rambaut and Grassly, 1997) and
PAUP* (Swofford, 1999), as in the analysis of the HIV-1 data set
in Goldman, Anderson, and Rodrigo (2000). Equivalent analyses
are possible using the evolver and baseml or codeml programs of
the PAML package (Yang, 1997).
Introduction to Programs
Seq-Gen (Sequence Generator) (Rambaut
and Grassly, 1997) produces Monte Carlo simulations of multiple
sequence alignments. Seq-Gen is a freeware program written by
Andrew Rambaut and Nick Grassly that can be downloaded at:
PAUP* (Swofford, 1999) is a package
of phylogenetic analyses designed for the inference of evolutionary
topologies. PAUP* is a program written by David Swofford that
can be ordered from:
PAML (Yang, 1997) is a package of phylogenetic
analyses that uses maximum likelihood methods to infer evolutionary
topologies and perform comparisons of models of sequence evolution
for DNA and protein sequences. PAML is a program written by Ziheng
Yang that can be freely downloaded from:
This manual is intended to give you
step-by-step instructions on how to do the SOWH test. These notes
should be read alongside our in depth discussion of likelihood-based
tests of topology and the SOWH test:
Goldman, N., Anderson, J. P., and Rodrigo, A. G. (2000) Likelihood-based
tests of topologies in phylogenetics. Systematic Biology 49:
652-670.
More information and sequence alignments may also be obtained
from:
The SOWH test is a likelihood-based
test used to compare tree topologies which are not specified a
priori. (i.e. The SOWH test can be used to test the maximum
likelihood tree topology and an alternative topology). The basic
principle behind the SOWH test is first getting a null distribution
for the difference in likelihood scores and then testing the observed
data against the null distribution. We use a parametric re-sampling/reanalysis
approach to produce the null distribution. This is accomplished
by simulating sequence alignments under the null hypothesis topology
with the program Seq-Gen and then calculating the likelihood scores
of the null and best topology given the simulated alignments.
Part 1: Instructions for the SOWH test
SECOND WARNING - The correct use
of the SOWH test with full optimization is computationally intense
and may be difficult to use on sample sets of 15 or greater taxa.
Numerous partial optimization procedures may also be performed
on data sets. A discussion of partial optimization procedures,
including speculation on its effects, can be found in Goldman,
Anderson, and Rodrigo (2000). One partial optimization version
of the SOWH test will be outlined and it may be useful for larger
data sets. The analysis time for these tests will vary depending
on the sample sets and the speed of the computer.
The following example data sets are presented in a format suitable
for PAUP*
For my examples I will use the following
observed data set:
#NEXUS
Begin data;
Dimensions ntax=6 nchar=50;
Format gap=- datatype=DNA matchchar=.
interleave;
Matrix
SEQ1 GAGGGTAGACGGAGGGAGATGGGTGCGAGACCATCAGTATAACGTGGTGG
SEQ2 GAGGCGAGAAGGAGAGAGATGGCTGCGAAAGCTACAGTATTACGTAGGCG
SEQ3 GAGGCTAGAAGGTGGGAGTTGAGTCCGAGAGCTACAGTATAAAGTGGTCG
SEQ4 GAGGCTAGACGGAGAGAGTTGGCTGCGAAAGCGACAATATTAAGTAGGCG
SEQ5 GAGCCTAGAACGTGAGAGATGGGTCCGAGAGCGTCAGTATTAAGTGGTGG
SEQ6 GAGGGTAGAAGGAAGGAGATGAGTGCAAGACCATCAGTATAACGTGGGGA
;
End;
The following Null hypothesis topology
(T1):
#NEXUS
begin trees;
translate
1
SEQ1,
2
SEQ2,
3
SEQ3,
4
SEQ4,
5
SEQ5,
6
SEQ6
;
tree T1 = [&U] (1,((((2,3),4),5),6));
end;
The initial analysis (not shown) of the
above sequence data using a GTR (a.k.a. REV) model of nucleotide
substitution with estimated substitution rates and an estimated
gamma distribution of rate heterogeneity, produced the following
ML tree topology (TML):
#NEXUS
begin trees;
translate
1
SEQ1,
2
SEQ2,
3
SEQ3,
4
SEQ4,
5
SEQ5,
6
SEQ6
;
tree TML = [&U] (1,((((2,4),3),5),6));
end;
Step 1: Calculate the test Statistic for the difference in log likelihoods.
Throughout these instructions I will assume that you are familiar with the program PAUP* and with phylogenetic analyses in general.
1) To make all of the following steps much easier, you should limit the names of your sequences to 9 characters, and use only letters and numbers in the names. (It may be a little work now but it will pay off in the end)
2) Using the program PAUP* we will estimate the free parameters for the null tree topology from the observed data. The free parameters that we used included the nucleotide frequencies, (_A, _C, _G, and _T), the rate matrix values (µa, µb, µc, µd, µe, and µf), and the shape parameter for the gamma rate heterogeneity (a). NOTE: The values for the rate matrix are scaled so that µf is 1.0
3) Given the observed data calculate the difference in log
likelihoods for T1 and TML
For this example: LML - L1 = (-207.18076)
- (-213.93310) = 6.75234
The Rate estimates and PAUP* settings for this data set are
shown below. Since PAUP* is estimating the rates on the fly, you
do not need to know the exact rates, but I thought they would
be useful to see when running through this example. This data
is copied from the PAUP* display buffer. NOTE:
Readers should remember that his data set is totally made-up and
the estimated rates may not be realistic.
ML settings:
Assumed nucleotide frequencies (empirical frequencies):
A = 0.30000
C = 0.12667
G = 0.40667
T = 0.16667
Number of substitution types = 6
Rates assumed to follow a gamma distribution with
shape parameter estimated via maximum likelihood
Settings for discrete gamma approximation:
Number of rate categories = 4
Average rate for each category represented
by mean
These settings correspond to the general-time-reversible
model (e.g., Yang, 1994) (with rate heterogeneity)
Substitution rate-matrix parameters estimated via
maximum likelihood
Number of distinct data patterns under this model
= 24
Molecular clock not enforced
Starting branch lengths obtained using Rogers-Swofford
approximation method
Convergence cutoffs: pass limit=20, delta ln L=1e-06
-ln L (unconstrained) = 138.63966
Tree number 1:
-Ln likelihood = 207.27984
Estimated R-matrix:
-
3.1094131 1.5751787
2.990047
3.1094131
-
3.5000516 4.7301092e-10
1.5751787
3.5000516 -
1
2.990047
4.7301092e-10 1
-
Corresponding Q-matrix:
-1.0826178
0.27818783 0.45244499 0.35198499
0.65886592 -1.6642
1.005334 5.5682316e-11
0.33377089 0.31313683
-0.76462661 0.11771888
0.63357298 4.231856e-11
0.28723407 -0.92080705
Estimated value of gamma shape parameter = 0.877220
RATES FOR THE NULL topOLOGY
ML settings:
Assumed nucleotide frequencies (empirical frequencies):
A = 0.30000
C = 0.12667
G = 0.40667
T = 0.16667
Number of substitution types = 6
Rates assumed to follow a gamma distribution with
shape parameter estimated via maximum likelihood
Settings for discrete gamma approximation:
Number of rate categories = 4
Average rate for each category represented
by mean
These settings correspond to the general-time-reversible
model (e.g., Yang, 1994) (with rate heterogeneity)
Substitution rate-matrix parameters estimated via
maximum likelihood
Number of distinct data patterns under this model
= 24
Molecular clock not enforced
Starting branch lengths obtained using Rogers-Swofford
approximation method
Convergence cutoffs: pass limit=20, delta ln L=1e-06
-ln L (unconstrained) = 138.63966
Tree number 1:
-Ln likelihood = 214.02264
Estimated R-matrix:
-
2.9508387 1.4961897
2.5025365
2.9508387
-
4.0185278 4.1054598e-11
1.4961897
4.0185278 -
1
2.5025365
4.1054598e-11 1
-
Corresponding Q-matrix:
-1.0073224 0.26906767
0.43800487 0.3002499
0.63726553 -1.813677
1.1764115 4.9256579e-12
0.32311835 0.36642325
-0.80951983 0.11997823
0.54044981 3.7435e-12
0.29274688 -0.83319669
Estimated value of gamma shape parameter = 0.603666
Step 2: Simulating Data Sets by Parametric Bootstrapping.
This step will use your null hypothesis
tree topology (T1) and the ML estimated free parameters
for T1 from the observed data set.
1) Given the observed data, import the null topology (T1) and estimate the free parameters. (Set the analysis to likelihood, and set the following likelihood settings - Substitution model to General time-reversible and estimate the rate matrix; Base frequencies to estimate; and for the Among-site rate variation, set variable sites to gamma distribution and estimate the shape parameter)
2) Open the file under Trees > Tree Scores > Likelihood and then select OK to estimate the free parameters. You will need these estimated numbers for the program Seq-Gen.
3) Save the null tree topology to a new file with branch lengths included and in PHYLIP format. This tree file will be used by the program Seq-Gen.
4) Using Seq-Gen, simulate the required number of data sets (e.g. 100 to 1000) using the topology T1 that you just got from PAUP*. The settings for Seq-Gen must match the null hypothesis output from PAUP* (i.e. the same base frequencies, nucleotide substitution rates, alpha, ect. ). For my example data set, the argument for Seq-Gen should look something like:
-m REV -l 50 -n 100 -a 0. 603666 -f 0.3,0.12667,0.40667,0.16667
-r 2.9508387,1.4961897,2.5025365,4.0185278,0,1
5) You now should have a file that contains the multiple simulation
sets from Seq-Gen. For my example I got the following file (only
two of 100 data sets shown):
6 50
SEQ1 ATACGCATTATTACGCAACTTATACGGAGCCCCGTTTCAAACCGTGTCCC
SEQ2 GTACGGATTATTAGAAAACTTTAGCGCAGCCCGGTTTCGAACCGTGTGTC
SEQ3 CTACGCATTACTATCACACTTTATCGCAGCCCGGTTTCCAACCGTGTCAC
SEQ4 GTACCCATTATTAGAACACTTTAACGCAGCCCGGTATCCAACCGTGTGGC
SEQ5 TTACGCATTATTAGGAAACTTTATGGAAGCCCGGTATCAGACCGTGTCCC
SEQ6 TTAAGCATTATTACCCGACTTATACGGAGCCCCGTTTCATACGGTGTCGC
6 50
SEQ1 AAACTTTCTTATCAACTGGACGAAGAAGTCACCCGCCCCTTGGCGAACCA
SEQ2 AATTTTTGGTATCAAATGGACGAAGAGCGGACCCGCGCCTTGGCCAGCCA
SEQ3 CATCGTTGGTATCAAATAGACGAAGAGCCGACCCGCGCCTTGGCACGCCT
SEQ4 TATTTTTGGTATCAAATGGACGAAGACCGGACCCGCGCCTTGGCCAGCCT
SEQ5 AAATATTCGTATCAACTCGACGAAGAAGGGACCCGCCCCTTGGCCAACCA
SEQ6 GAACTTTGTTATCAACTGGACGAAGAAGTCACCCGCCCCTTGGCGAACCC
Step 3: Changing the Seq-Gen output files to PAUP* files.
This step will require you to find and replace sections of the Seq-Gen output file so that it will run on PAUP*.
1) Open the Seq-Gen output file in a text editor (I suggest using BBEdit - this programs does not place returns on long lines and has a great find and replace feature)
2) Below will be a portion of a PAUP block that you will need to edit slightly. You will need to change the number of taxa (ntaxa=6) to reflect the number you are working with, you will need to change the sequence length (nchar=50) to your sequence length, and you may need to change the name of your null topology to reflect the null topology that you have saved in nexus format (gettrees file = T1;). NOTE: The PAUP block that is used should create the same settings that were used to analyze the observed (real) data.
;
end;
begin paup;
[! Calculating Likelihood Scores
]
set criterion = likelihood;
Lset
NST = 6 RMatrix = estimate
Rates
= gamma shape = estimate;
HSearch start = NJ nbest = 1 status = no;
savetrees file = TML append = yes format = phylip;
Lscore / scorefile = TML_Score append = yes;
gettrees file=T1;
Lscore / scorefile = T1_Score append = yes;
end;
Begin data;
Dimensions
ntax=6 nchar=50;
Format
datatype=nucleotide gap=- missing=? matchchar=.;
Matrix
3) Replace the numbers that correspond to the number of sequences and the length of the sequences (e.g. 6 50 in my example) with the entire block that you have edited above. This should begin with the semicolon ì;î and end with ìMatrixî. (You will need to replace every occurrence of ì6 50î in the data set (e.g. 100 to 1000 times).
4) You will now need to make a few small changes to the beginning and the end of the file. (I suggest saving this file under a new name before you make any more changes)
5) Delete the beginning lines of the file up to but not including ìBegin Dataî and add ì#NEXUSî to the first line of the file.
6) At the very end of the file you will also need to paste the block from above and then remove the section beginning with ìBegin Dataî and ending with ìMatrixî. Now you should have a true PAUP* Block that looks something like the following:
#NEXUS
Begin data;
Dimensions ntax=6 nchar=50;
Format datatype=nucleotide gap=-
missing=? matchchar=.;
Matrix
SEQ1 ATACGCATTATTACGCAACTTATACGGAGCCCCGTTTCAAACCGTGTCCC
SEQ2 GTACGGATTATTAGAAAACTTTAGCGCAGCCCGGTTTCGAACCGTGTGTC
SEQ3 CTACGCATTACTATCACACTTTATCGCAGCCCGGTTTCCAACCGTGTCAC
SEQ4 GTACCCATTATTAGAACACTTTAACGCAGCCCGGTATCCAACCGTGTGGC
SEQ5 TTACGCATTATTAGGAAACTTTATGGAAGCCCGGTATCAGACCGTGTCCC
SEQ6 TTAAGCATTATTACCCGACTTATACGGAGCCCCGTTTCATACGGTGTCGC
;
end;
begin paup;
[! Calculating Likelihood Scores
]
set criterion = likelihood;
Lset NST = 6 RMatrix = estimate
Rates = gamma shape = estimate;
HSearch start = NJ nbest = 1 status = no;
savetrees file = TML append = yes format = phylip;
Lscore / scorefile = TML_Score append = yes;
gettrees file=T1;
Lscore / scorefile = T1_Score append = yes;
end;
Begin data;
Dimensions ntax=6 nchar=50;
Format datatype=nucleotide gap=-
missing=? matchchar=.;
Matrix
SEQ1
AAACTTTCTTATCAACTGGACGAAGAAGTCACCCGCCCCTTGGCGAACCA
SEQ2 AATTTTTGGTATCAAATGGACGAAGAGCGGACCCGCGCCTTGGCCAGCCA
SEQ3 CATCGTTGGTATCAAATAGACGAAGAGCCGACCCGCGCCTTGGCACGCCT
SEQ4 TATTTTTGGTATCAAATGGACGAAGACCGGACCCGCGCCTTGGCCAGCCT
SEQ5 AAATATTCGTATCAACTCGACGAAGAAGGGACCCGCCCCTTGGCCAACCA
SEQ6 GAACTTTGTTATCAACTGGACGAAGAAGTCACCCGCCCCTTGGCGAACCC
begin paup;
[! Calculating Likelihood Scores
]
set criterion = likelihood;
Lset NST = 6 RMatrix = estimate
Rates = gamma shape = estimate;
HSearch start = NJ nbest = 1 status = no;
savetrees file = TML append = yes format = phylip;
Lscore / scorefile = TML_Score append = yes;
gettrees file=T1;
Lscore / scorefile = T1_Score append = yes;
end;
7) Save your new PAUP* file in the same folder that you have
your null hypothesis tree topology in Nexus format (T1).
Step 4: Run the PAUP* Block and analyze
the results.
1) You can now open your new PAUP* file and execute it. It will not give you any feedback on how it is going because you have turned the status off. (This is needed so that it will run all of the files without you having to press the return key after every completed run - how would you like to have to hit the return 1000 times.)
2) You can check up on the program by opening the files named "TML_Score" or "T1_Score". The program will finish with n number of lines in the file; n being the number of simulation sets you specified in the Seq-Gen program.
3) When the block is finished and you have all of the data, you will need to get the distribution of likelihood differences. I used Microsoft Excel to open the files "TML_Score" and "T1_Score" and then calculated the difference (LML - L1) for each sample set. (NOTE: the likelihood scores in each of these files are in order so you need to calculate the difference using the order that they are in)
4) Finally, test whether the difference in log likelihoods
from the observed data (calculated in Step 1) falls below the
95% point of the ranked list of differences calculated from the
simulated data sets. (This corresponds to a one tailed test at
a significance level of 5%).
PART 2: Partial Optimization of the SOWH
Test
The only real difference between the
full optimization and the partial optimization is that you will
not be estimating the free parameters while you are searching
for the best tree. Instead, you will be using the free parameters
estimated from the null topology when searching for the best topology.
If you run this partial optimization program and you do not reject the
null topology then you would have gotten the same result using
the full optimization version. But if do reject the null
topology then it is uncertain if you would have rejected it using
the full optimization version.
This partial optimization method is
further discussed as the second of two partial optimization "posPpud";
tests in Goldman et al. (2000).
In this partial optimization method, all of the parameters
of the substitution model are fixed at their values under T1 and
are not re-optimised as in the full optimization SOWH method.
The maximum likelihood topology is determined using these fixed
rates and the branch lengths of the ML topology are optimised.
Thus, much computer time is saved by not simultaneously searching
for the ML topology and estimating the free parameters.
1) To run the partial optimization version, you will just need to replace the partial paup Block in Step 3 number 2 with the partial block below:
;
end;
begin paup;
[! Calculating Likelihood Scores
]
set criterion = likelihood;
Lset
NST = 6 RMatrix = estimate
Rates
= gamma shape = estimate;
gettrees file=T1;
Lscore / scorefile = T1_Score append = yes;
set criterion = likelihood;
Lset
NST = 6 RMatrix = previous
Rates
= gamma shape = previous;
HSearch start = stepwise nbest = 1 status = no;
savetrees file = TML append = yes format = phylip;
Lscore / scorefile = TML_Score append = yes;
Begin data;
Dimensions
ntax=6 nchar=50;
Format
datatype=nucleotide gap=- missing=? matchchar=.;
Matrix
REFERENCES.
Rambaut, A. and N. C. Grassly (1997). "Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees." Comput. Appl. Biosci. 13(3): 235-238.
Swofford, D. L. (1999). PAUP 4.0: Phylogenetic Analysis Using Parsimony (And Other Methods). Sunderland, MA, Sinauer Associates, Inc.
Yang, Z. (1997). "PAML: a program package for phylogenetic analysis by maximum likelihood." CABIOS 13: 555-556.