Guidelines for performing the SOWH Test

Jon Anderson¹, Nick Goldman², and Allen Rodrigo³

¹Department of Molecular Biotechnology, University of Washington, Seattle, USA

²Department of Zoology, University of Cambridge, Cambridge, UK

³School of Biological Sciences, University of Auckland, Auckland, New Zealand

WARNING - These are the steps that we have used to perform the SOWH test. There is currently no automated "easy" was to perform this test so you may have to manipulate very large data files in order to perform the test. These notes describe the use of the computer programs Seq-Gen (Rambaut and Grassly, 1997) and PAUP* (Swofford, 1999), as in the analysis of the HIV-1 data set in Goldman, Anderson, and Rodrigo (2000). Equivalent analyses are possible using the evolver and baseml or codeml programs of the PAML package (Yang, 1997).

Introduction to Programs

Seq-Gen (Sequence Generator) (Rambaut and Grassly, 1997) produces Monte Carlo simulations of multiple sequence alignments. Seq-Gen is a freeware program written by Andrew Rambaut and Nick Grassly that can be downloaded at:

http://evolve.zps.ox.ac.uk/Seq-Gen/Seq-Gen.html

PAUP* (Swofford, 1999) is a package of phylogenetic analyses designed for the inference of evolutionary topologies. PAUP* is a program written by David Swofford that can be ordered from:

http://www.lms.si.edu/PAUP/

PAML (Yang, 1997) is a package of phylogenetic analyses that uses maximum likelihood methods to infer evolutionary topologies and perform comparisons of models of sequence evolution for DNA and protein sequences. PAML is a program written by Ziheng Yang that can be freely downloaded from:

http://abacus.gene.ucl.ac.uk/software/paml.html

Introduction to SOWH

This manual is intended to give you step-by-step instructions on how to do the SOWH test. These notes should be read alongside our in depth discussion of likelihood-based tests of topology and the SOWH test:

Goldman, N., Anderson, J. P., and Rodrigo, A. G. (2000) Likelihood-based tests of topologies in phylogenetics. Systematic Biology 49: 652-670.

More information and sequence alignments may also be obtained from:

http://www.ebi.ac.uk/goldman/tests/

The SOWH test is a likelihood-based test used to compare tree topologies which are not specified a priori. (i.e. The SOWH test can be used to test the maximum likelihood tree topology and an alternative topology). The basic principle behind the SOWH test is first getting a null distribution for the difference in likelihood scores and then testing the observed data against the null distribution. We use a parametric re-sampling/reanalysis approach to produce the null distribution. This is accomplished by simulating sequence alignments under the null hypothesis topology with the program Seq-Gen and then calculating the likelihood scores of the null and best topology given the simulated alignments.

Part 1: Instructions for the SOWH test

SECOND WARNING - The correct use of the SOWH test with full optimization is computationally intense and may be difficult to use on sample sets of 15 or greater taxa. Numerous partial optimization procedures may also be performed on data sets. A discussion of partial optimization procedures, including speculation on its effects, can be found in Goldman, Anderson, and Rodrigo (2000). One partial optimization version of the SOWH test will be outlined and it may be useful for larger data sets. The analysis time for these tests will vary depending on the sample sets and the speed of the computer.

The following example data sets are presented in a format suitable for PAUP*

For my examples I will use the following observed data set:

#NEXUS

Begin data; Dimensions ntax=6 nchar=50; Format gap=- datatype=DNA matchchar=. interleave; Matrix

SEQ1 GAGGGTAGACGGAGGGAGATGGGTGCGAGACCATCAGTATAACGTGGTGG SEQ2 GAGGCGAGAAGGAGAGAGATGGCTGCGAAAGCTACAGTATTACGTAGGCG SEQ3 GAGGCTAGAAGGTGGGAGTTGAGTCCGAGAGCTACAGTATAAAGTGGTCG SEQ4 GAGGCTAGACGGAGAGAGTTGGCTGCGAAAGCGACAATATTAAGTAGGCG SEQ5 GAGCCTAGAACGTGAGAGATGGGTCCGAGAGCGTCAGTATTAAGTGGTGG SEQ6 GAGGGTAGAAGGAAGGAGATGAGTGCAAGACCATCAGTATAACGTGGGGA ; End;

The following Null hypothesis topology (T1):

#NEXUS

begin trees;

translate 1 SEQ1, 2 SEQ2, 3 SEQ3, 4 SEQ4, 5 SEQ5, 6 SEQ6 ; tree T1 = [&U] (1,((((2,3),4),5),6)); end;

The initial analysis (not shown) of the above sequence data using a GTR (a.k.a. REV) model of nucleotide substitution with estimated substitution rates and an estimated gamma distribution of rate heterogeneity, produced the following ML tree topology (TML):

#NEXUS

begin trees; translate 1 SEQ1, 2 SEQ2, 3 SEQ3, 4 SEQ4, 5 SEQ5, 6 SEQ6 ; tree TML = [&U] (1,((((2,4),3),5),6)); end;

Step 1: Calculate the test Statistic for the difference in log likelihoods.

Throughout these instructions I will assume that you are familiar with the program PAUP* and with phylogenetic analyses in general.

1) To make all of the following steps much easier, you should limit the names of your sequences to 9 characters, and use only letters and numbers in the names. (It may be a little work now but it will pay off in the end)

2) Using the program PAUP* we will estimate the free parameters for the null tree topology from the observed data. The free parameters that we used included the nucleotide frequencies, (__A, __C, __G, and __T), the rate matrix values (µa, µb, µc, µd, µe, and µf), and the shape parameter for the gamma rate heterogeneity (a). NOTE: The values for the rate matrix are scaled so that µf is 1.0

3) Given the observed data calculate the difference in log likelihoods for T₁ and T_ML

For this example: L_ML - L₁ = (-207.18076) - (-213.93310) = 6.75234

The Rate estimates and PAUP* settings for this data set are shown below. Since PAUP* is estimating the rates on the fly, you do not need to know the exact rates, but I thought they would be useful to see when running through this example. This data is copied from the PAUP* display buffer. NOTE: Readers should remember that his data set is totally made-up and the estimated rates may not be realistic.

RATES FOR THE ML topOLOGY

ML settings: Assumed nucleotide frequencies (empirical frequencies): A = 0.30000 C = 0.12667 G = 0.40667 T = 0.16667 Number of substitution types = 6 Rates assumed to follow a gamma distribution with shape parameter estimated via maximum likelihood Settings for discrete gamma approximation: Number of rate categories = 4 Average rate for each category represented by mean These settings correspond to the general-time-reversible model (e.g., Yang, 1994) (with rate heterogeneity) Substitution rate-matrix parameters estimated via maximum likelihood Number of distinct data patterns under this model = 24 Molecular clock not enforced Starting branch lengths obtained using Rogers-Swofford approximation method Convergence cutoffs: pass limit=20, delta ln L=1e-06 -ln L (unconstrained) = 138.63966 Tree number 1: -Ln likelihood = 207.27984 Estimated R-matrix: - 3.1094131 1.5751787 2.990047 3.1094131 - 3.5000516 4.7301092e-10 1.5751787 3.5000516 - 1 2.990047 4.7301092e-10 1 - Corresponding Q-matrix: -1.0826178 0.27818783 0.45244499 0.35198499 0.65886592 -1.6642 1.005334 5.5682316e-11 0.33377089 0.31313683 -0.76462661 0.11771888 0.63357298 4.231856e-11 0.28723407 -0.92080705 Estimated value of gamma shape parameter = 0.877220

RATES FOR THE NULL topOLOGY

ML settings: Assumed nucleotide frequencies (empirical frequencies): A = 0.30000 C = 0.12667 G = 0.40667 T = 0.16667 Number of substitution types = 6 Rates assumed to follow a gamma distribution with shape parameter estimated via maximum likelihood Settings for discrete gamma approximation: Number of rate categories = 4 Average rate for each category represented by mean These settings correspond to the general-time-reversible model (e.g., Yang, 1994) (with rate heterogeneity) Substitution rate-matrix parameters estimated via maximum likelihood Number of distinct data patterns under this model = 24 Molecular clock not enforced Starting branch lengths obtained using Rogers-Swofford approximation method Convergence cutoffs: pass limit=20, delta ln L=1e-06 -ln L (unconstrained) = 138.63966 Tree number 1: -Ln likelihood = 214.02264 Estimated R-matrix: - 2.9508387 1.4961897 2.5025365 2.9508387 - 4.0185278 4.1054598e-11 1.4961897 4.0185278 - 1 2.5025365 4.1054598e-11 1 - Corresponding Q-matrix: -1.0073224 0.26906767 0.43800487 0.3002499 0.63726553 -1.813677 1.1764115 4.9256579e-12 0.32311835 0.36642325 -0.80951983 0.11997823 0.54044981 3.7435e-12 0.29274688 -0.83319669 Estimated value of gamma shape parameter = 0.603666

Step 2: Simulating Data Sets by Parametric Bootstrapping.

This step will use your null hypothesis tree topology (T₁) and the ML estimated free parameters for T₁ from the observed data set.

1) Given the observed data, import the null topology (T1) and estimate the free parameters. (Set the analysis to likelihood, and set the following likelihood settings - Substitution model to General time-reversible and estimate the rate matrix; Base frequencies to estimate; and for the Among-site rate variation, set variable sites to gamma distribution and estimate the shape parameter)

2) Open the file under Trees > Tree Scores > Likelihood and then select OK to estimate the free parameters. You will need these estimated numbers for the program Seq-Gen.

3) Save the null tree topology to a new file with branch lengths included and in PHYLIP format. This tree file will be used by the program Seq-Gen.

4) Using Seq-Gen, simulate the required number of data sets (e.g. 100 to 1000) using the topology T1 that you just got from PAUP*. The settings for Seq-Gen must match the null hypothesis output from PAUP* (i.e. the same base frequencies, nucleotide substitution rates, alpha, ect. ). For my example data set, the argument for Seq-Gen should look something like:

-m REV -l 50 -n 100 -a 0. 603666 -f 0.3,0.12667,0.40667,0.16667 -r 2.9508387,1.4961897,2.5025365,4.0185278,0,1

5) You now should have a file that contains the multiple simulation sets from Seq-Gen. For my example I got the following file (only two of 100 data sets shown):

6 50 SEQ1 ATACGCATTATTACGCAACTTATACGGAGCCCCGTTTCAAACCGTGTCCC SEQ2 GTACGGATTATTAGAAAACTTTAGCGCAGCCCGGTTTCGAACCGTGTGTC SEQ3 CTACGCATTACTATCACACTTTATCGCAGCCCGGTTTCCAACCGTGTCAC SEQ4 GTACCCATTATTAGAACACTTTAACGCAGCCCGGTATCCAACCGTGTGGC SEQ5 TTACGCATTATTAGGAAACTTTATGGAAGCCCGGTATCAGACCGTGTCCC SEQ6 TTAAGCATTATTACCCGACTTATACGGAGCCCCGTTTCATACGGTGTCGC 6 50 SEQ1 AAACTTTCTTATCAACTGGACGAAGAAGTCACCCGCCCCTTGGCGAACCA SEQ2 AATTTTTGGTATCAAATGGACGAAGAGCGGACCCGCGCCTTGGCCAGCCA SEQ3 CATCGTTGGTATCAAATAGACGAAGAGCCGACCCGCGCCTTGGCACGCCT SEQ4 TATTTTTGGTATCAAATGGACGAAGACCGGACCCGCGCCTTGGCCAGCCT SEQ5 AAATATTCGTATCAACTCGACGAAGAAGGGACCCGCCCCTTGGCCAACCA SEQ6 GAACTTTGTTATCAACTGGACGAAGAAGTCACCCGCCCCTTGGCGAACCC

Step 3: Changing the Seq-Gen output files to PAUP* files.

This step will require you to find and replace sections of the Seq-Gen output file so that it will run on PAUP*.

1) Open the Seq-Gen output file in a text editor (I suggest using BBEdit - this programs does not place returns on long lines and has a great find and replace feature)

2) Below will be a portion of a PAUP block that you will need to edit slightly. You will need to change the number of taxa (ntaxa=6) to reflect the number you are working with, you will need to change the sequence length (nchar=50) to your sequence length, and you may need to change the name of your null topology to reflect the null topology that you have saved in nexus format (gettrees file = T1;). NOTE: The PAUP block that is used should create the same settings that were used to analyze the observed (real) data.

;
end;

begin paup;
[! • • • Calculating Likelihood Scores • • •]
set criterion = likelihood;
Lset NST = 6 RMatrix = estimate
Rates = gamma shape = estimate;
HSearch start = NJ nbest = 1 status = no;
savetrees file = TML append = yes format = phylip;
Lscore / scorefile = TML_Score append = yes;
gettrees file=T1;
Lscore / scorefile = T1_Score append = yes;
end;

Begin data;
           Dimensions ntax=6 nchar=50;
           Format datatype=nucleotide gap=- missing=? matchchar=.;
           Matrix

3) Replace the numbers that correspond to the number of sequences and the length of the sequences (e.g. 6 50 in my example) with the entire block that you have edited above. This should begin with the semicolon ì;î and end with ìMatrixî. (You will need to replace every occurrence of ì6 50î in the data set (e.g. 100 to 1000 times).

4) You will now need to make a few small changes to the beginning and the end of the file. (I suggest saving this file under a new name before you make any more changes)

5) Delete the beginning lines of the file up to but not including ìBegin Dataî and add ì#NEXUSî to the first line of the file.

6) At the very end of the file you will also need to paste the block from above and then remove the section beginning with ìBegin Dataî and ending with ìMatrixî. Now you should have a true PAUP* Block that looks something like the following:

#NEXUS

Begin data; Dimensions ntax=6 nchar=50; Format datatype=nucleotide gap=- missing=? matchchar=.; Matrix SEQ1 ATACGCATTATTACGCAACTTATACGGAGCCCCGTTTCAAACCGTGTCCC SEQ2 GTACGGATTATTAGAAAACTTTAGCGCAGCCCGGTTTCGAACCGTGTGTC SEQ3 CTACGCATTACTATCACACTTTATCGCAGCCCGGTTTCCAACCGTGTCAC SEQ4 GTACCCATTATTAGAACACTTTAACGCAGCCCGGTATCCAACCGTGTGGC SEQ5 TTACGCATTATTAGGAAACTTTATGGAAGCCCGGTATCAGACCGTGTCCC SEQ6 TTAAGCATTATTACCCGACTTATACGGAGCCCCGTTTCATACGGTGTCGC ; end;

begin paup; [! • • • Calculating Likelihood Scores • • •] set criterion = likelihood; Lset NST = 6 RMatrix = estimate Rates = gamma shape = estimate; HSearch start = NJ nbest = 1 status = no; savetrees file = TML append = yes format = phylip; Lscore / scorefile = TML_Score append = yes; gettrees file=T1; Lscore / scorefile = T1_Score append = yes; end;

Begin data; Dimensions ntax=6 nchar=50; Format datatype=nucleotide gap=- missing=? matchchar=.; Matrix

SEQ1 AAACTTTCTTATCAACTGGACGAAGAAGTCACCCGCCCCTTGGCGAACCA SEQ2 AATTTTTGGTATCAAATGGACGAAGAGCGGACCCGCGCCTTGGCCAGCCA SEQ3 CATCGTTGGTATCAAATAGACGAAGAGCCGACCCGCGCCTTGGCACGCCT SEQ4 TATTTTTGGTATCAAATGGACGAAGACCGGACCCGCGCCTTGGCCAGCCT SEQ5 AAATATTCGTATCAACTCGACGAAGAAGGGACCCGCCCCTTGGCCAACCA SEQ6 GAACTTTGTTATCAACTGGACGAAGAAGTCACCCGCCCCTTGGCGAACCC begin paup; [! • • • Calculating Likelihood Scores • • •] set criterion = likelihood; Lset NST = 6 RMatrix = estimate Rates = gamma shape = estimate; HSearch start = NJ nbest = 1 status = no; savetrees file = TML append = yes format = phylip; Lscore / scorefile = TML_Score append = yes; gettrees file=T1; Lscore / scorefile = T1_Score append = yes; end;

7) Save your new PAUP* file in the same folder that you have your null hypothesis tree topology in Nexus format (T1).

Step 4: Run the PAUP* Block and analyze the results.

1) You can now open your new PAUP* file and execute it. It will not give you any feedback on how it is going because you have turned the status off. (This is needed so that it will run all of the files without you having to press the return key after every completed run - how would you like to have to hit the return 1000 times.)

2) You can check up on the program by opening the files named "TML_Score" or "T1_Score". The program will finish with n number of lines in the file; n being the number of simulation sets you specified in the Seq-Gen program.

3) When the block is finished and you have all of the data, you will need to get the distribution of likelihood differences. I used Microsoft Excel to open the files "TML_Score" and "T1_Score" and then calculated the difference (L_ML - L₁) for each sample set. (NOTE: the likelihood scores in each of these files are in order so you need to calculate the difference using the order that they are in)

4) Finally, test whether the difference in log likelihoods from the observed data (calculated in Step 1) falls below the 95% point of the ranked list of differences calculated from the simulated data sets. (This corresponds to a one tailed test at a significance level of 5%).

PART 2: Partial Optimization of the SOWH Test

The only real difference between the full optimization and the partial optimization is that you will not be estimating the free parameters while you are searching for the best tree. Instead, you will be using the free parameters estimated from the null topology when searching for the best topology. If you run this partial optimization program and you do not reject the null topology then you would have gotten the same result using the full optimization version. But if do reject the null topology then it is uncertain if you would have rejected it using the full optimization version.

This partial optimization method is further discussed as the second of two partial optimization "posPpud"; tests in Goldman et al. (2000).

In this partial optimization method, all of the parameters of the substitution model are fixed at their values under T1 and are not re-optimised as in the full optimization SOWH method. The maximum likelihood topology is determined using these fixed rates and the branch lengths of the ML topology are optimised. Thus, much computer time is saved by not simultaneously searching for the ML topology and estimating the free parameters.

1) To run the partial optimization version, you will just need to replace the partial paup Block in Step 3 number 2 with the partial block below:

;
end;

begin paup;
[! • • • Calculating Likelihood Scores • • •]
set criterion = likelihood;
           Lset NST = 6 RMatrix = estimate
           Rates = gamma shape = estimate;
gettrees file=T1;
Lscore / scorefile = T1_Score append = yes;
set criterion = likelihood;
           Lset NST = 6 RMatrix = previous
           Rates = gamma shape = previous;
HSearch start = stepwise nbest = 1 status = no;
savetrees file = TML append = yes format = phylip;
Lscore / scorefile = TML_Score append = yes;

Begin data;
           Dimensions ntax=6 nchar=50;
           Format datatype=nucleotide gap=- missing=? matchchar=.;
           Matrix

REFERENCES.

Rambaut, A. and N. C. Grassly (1997). "Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees." Comput. Appl. Biosci. 13(3): 235-238.

Swofford, D. L. (1999). PAUP 4.0: Phylogenetic Analysis Using Parsimony (And Other Methods). Sunderland, MA, Sinauer Associates, Inc.

Yang, Z. (1997). "PAML: a program package for phylogenetic analysis by maximum likelihood." CABIOS 13: 555-556.