Guidelines for performing the SOWH Test
    

Jon Anderson1, Nick Goldman2, and Allen Rodrigo3

1Department of Molecular Biotechnology, University of Washington, Seattle, USA

2Department of Zoology, University of Cambridge, Cambridge, UK

3School of Biological Sciences, University of Auckland, Auckland, New Zealand

WARNING - These are the steps that we have used to perform the SOWH test. There is currently no automated "easy" was to perform this test so you may have to manipulate very large data files in order to perform the test. These notes describe the use of the computer programs Seq-Gen (Rambaut and Grassly, 1997) and PAUP* (Swofford, 1999), as in the analysis of the HIV-1 data set in Goldman, Anderson, and Rodrigo (2000). Equivalent analyses are possible using the evolver and baseml or codeml programs of the PAML package (Yang, 1997).

Introduction to Programs

    Seq-Gen (Sequence Generator) (Rambaut and Grassly, 1997) produces Monte Carlo simulations of multiple sequence alignments. Seq-Gen is a freeware program written by Andrew Rambaut and Nick Grassly that can be downloaded at:

http://evolve.zps.ox.ac.uk/Seq-Gen/Seq-Gen.html

    PAUP* (Swofford, 1999) is a package of phylogenetic analyses designed for the inference of evolutionary topologies. PAUP* is a program written by David Swofford that can be ordered from:

http://www.lms.si.edu/PAUP/

    PAML (Yang, 1997) is a package of phylogenetic analyses that uses maximum likelihood methods to infer evolutionary topologies and perform comparisons of models of sequence evolution for DNA and protein sequences. PAML is a program written by Ziheng Yang that can be freely downloaded from:

http://abacus.gene.ucl.ac.uk/software/paml.html

Introduction to SOWH

    This manual is intended to give you step-by-step instructions on how to do the SOWH test. These notes should be read alongside our in depth discussion of likelihood-based tests of topology and the SOWH test:

Goldman, N., Anderson, J. P., and Rodrigo, A. G. (2000) Likelihood-based tests of topologies in phylogenetics. Systematic Biology 49: 652-670.

More information and sequence alignments may also be obtained from:

http://www.ebi.ac.uk/goldman/tests/

    The SOWH test is a likelihood-based test used to compare tree topologies which are not specified a priori. (i.e. The SOWH test can be used to test the maximum likelihood tree topology and an alternative topology). The basic principle behind the SOWH test is first getting a null distribution for the difference in likelihood scores and then testing the observed data against the null distribution. We use a parametric re-sampling/reanalysis approach to produce the null distribution. This is accomplished by simulating sequence alignments under the null hypothesis topology with the program Seq-Gen and then calculating the likelihood scores of the null and best topology given the simulated alignments.

Part 1: Instructions for the SOWH test

SECOND WARNING - The correct use of the SOWH test with full optimization is computationally intense and may be difficult to use on sample sets of 15 or greater taxa. Numerous partial optimization procedures may also be performed on data sets. A discussion of partial optimization procedures, including speculation on its effects, can be found in Goldman, Anderson, and Rodrigo (2000). One partial optimization version of the SOWH test will be outlined and it may be useful for larger data sets. The analysis time for these tests will vary depending on the sample sets and the speed of the computer.

The following example data sets are presented in a format suitable for PAUP*

For my examples I will use the following observed data set:

#NEXUS

Begin data;
     Dimensions ntax=6 nchar=50;
     Format gap=- datatype=DNA matchchar=. interleave;
     Matrix

SEQ1   GAGGGTAGACGGAGGGAGATGGGTGCGAGACCATCAGTATAACGTGGTGG
SEQ2   GAGGCGAGAAGGAGAGAGATGGCTGCGAAAGCTACAGTATTACGTAGGCG
SEQ3   GAGGCTAGAAGGTGGGAGTTGAGTCCGAGAGCTACAGTATAAAGTGGTCG
SEQ4   GAGGCTAGACGGAGAGAGTTGGCTGCGAAAGCGACAATATTAAGTAGGCG
SEQ5   GAGCCTAGAACGTGAGAGATGGGTCCGAGAGCGTCAGTATTAAGTGGTGG
SEQ6   GAGGGTAGAAGGAAGGAGATGAGTGCAAGACCATCAGTATAACGTGGGGA
     ;
End;

The following Null hypothesis topology (T1):

#NEXUS

begin trees;

     translate
          1 SEQ1,
          2 SEQ2,
          3 SEQ3,
          4 SEQ4,
          5 SEQ5,
          6 SEQ6
          ;
tree T1 = [&U] (1,((((2,3),4),5),6));
end;

The initial analysis (not shown) of the above sequence data using a GTR (a.k.a. REV) model of nucleotide substitution with estimated substitution rates and an estimated gamma distribution of rate heterogeneity, produced the following ML tree topology (TML):

#NEXUS

begin trees;
     translate
          1 SEQ1,
          2 SEQ2,
          3 SEQ3,
          4 SEQ4,
          5 SEQ5,
          6 SEQ6
          ;
tree TML = [&U] (1,((((2,4),3),5),6));
end;

Step 1: Calculate the test Statistic for the difference in log likelihoods.

    Throughout these instructions I will assume that you are familiar with the program PAUP* and with phylogenetic analyses in general.

1) To make all of the following steps much easier, you should limit the names of your sequences to 9 characters, and use only letters and numbers in the names. (It may be a little work now but it will pay off in the end)

2) Using the program PAUP* we will estimate the free parameters for the null tree topology from the observed data. The free parameters that we used included the nucleotide frequencies, (_A, _C, _G, and _T), the rate matrix values (µa, µb, µc, µd, µe, and µf), and the shape parameter for the gamma rate heterogeneity (a). NOTE: The values for the rate matrix are scaled so that µf is 1.0

3) Given the observed data calculate the difference in log likelihoods for T1 and TML

For this example: LML - L1 = (-207.18076) - (-213.93310) = 6.75234

The Rate estimates and PAUP* settings for this data set are shown below. Since PAUP* is estimating the rates on the fly, you do not need to know the exact rates, but I thought they would be useful to see when running through this example. This data is copied from the PAUP* display buffer. NOTE: Readers should remember that his data set is totally made-up and the estimated rates may not be realistic.


RATES FOR THE ML topOLOGY

ML settings:
  Assumed nucleotide frequencies (empirical frequencies):
    A = 0.30000
    C = 0.12667
    G = 0.40667
    T = 0.16667
  Number of substitution types = 6
  Rates assumed to follow a gamma distribution with shape parameter estimated via maximum likelihood
  Settings for discrete gamma approximation:
    Number of rate categories = 4
    Average rate for each category represented by mean
  These settings correspond to the general-time-reversible model (e.g., Yang, 1994) (with rate heterogeneity)
  Substitution rate-matrix parameters estimated via maximum likelihood
  Number of distinct data patterns under this model = 24
  Molecular clock not enforced
  Starting branch lengths obtained using Rogers-Swofford approximation method
  Convergence cutoffs: pass limit=20, delta ln L=1e-06
  -ln L (unconstrained) = 138.63966

Tree number 1:
  -Ln likelihood = 207.27984
  Estimated R-matrix:
                 -     3.1094131     1.5751787      2.990047
         3.1094131             -     3.5000516 4.7301092e-10
         1.5751787     3.5000516             -             1
         2.990047  4.7301092e-10             1             -
  Corresponding Q-matrix:
        -1.0826178    0.27818783    0.45244499    0.35198499
        0.65886592       -1.6642      1.005334 5.5682316e-11
        0.33377089    0.31313683   -0.76462661    0.11771888
        0.63357298  4.231856e-11    0.28723407   -0.92080705
  Estimated value of gamma shape parameter = 0.877220


RATES FOR THE NULL topOLOGY

ML settings:
  Assumed nucleotide frequencies (empirical frequencies):
    A = 0.30000
    C = 0.12667
    G = 0.40667
    T = 0.16667
  Number of substitution types = 6
  Rates assumed to follow a gamma distribution with shape parameter estimated via maximum likelihood
  Settings for discrete gamma approximation:
    Number of rate categories = 4
    Average rate for each category represented by mean
  These settings correspond to the general-time-reversible model (e.g., Yang, 1994) (with rate heterogeneity)
  Substitution rate-matrix parameters estimated via maximum likelihood
  Number of distinct data patterns under this model = 24
  Molecular clock not enforced
  Starting branch lengths obtained using Rogers-Swofford approximation method
  Convergence cutoffs: pass limit=20, delta ln L=1e-06
  -ln L (unconstrained) = 138.63966

Tree number 1:
  -Ln likelihood = 214.02264
  Estimated R-matrix:
                 -     2.9508387     1.4961897     2.5025365
         2.9508387             -     4.0185278 4.1054598e-11
         1.4961897     4.0185278             -             1
         2.5025365 4.1054598e-11             1             -
  Corresponding Q-matrix:
        -1.0073224    0.26906767    0.43800487     0.3002499
        0.63726553     -1.813677     1.1764115 4.9256579e-12
        0.32311835    0.36642325   -0.80951983    0.11997823
        0.54044981    3.7435e-12    0.29274688   -0.83319669
  Estimated value of gamma shape parameter = 0.603666


Step 2: Simulating Data Sets by Parametric Bootstrapping.

    This step will use your null hypothesis tree topology (T1) and the ML estimated free parameters for T1 from the observed data set.

1) Given the observed data, import the null topology (T1) and estimate the free parameters. (Set the analysis to likelihood, and set the following likelihood settings - Substitution model to General time-reversible and estimate the rate matrix; Base frequencies to estimate; and for the Among-site rate variation, set variable sites to gamma distribution and estimate the shape parameter)

2) Open the file under Trees > Tree Scores > Likelihood and then select OK to estimate the free parameters. You will need these estimated numbers for the program Seq-Gen.

3) Save the null tree topology to a new file with branch lengths included and in PHYLIP format. This tree file will be used by the program Seq-Gen.

4) Using Seq-Gen, simulate the required number of data sets (e.g. 100 to 1000) using the topology T1 that you just got from PAUP*. The settings for Seq-Gen must match the null hypothesis output from PAUP* (i.e. the same base frequencies, nucleotide substitution rates, alpha, ect. ). For my example data set, the argument for Seq-Gen should look something like:

-m REV -l 50 -n 100 -a 0. 603666 -f 0.3,0.12667,0.40667,0.16667 -r 2.9508387,1.4961897,2.5025365,4.0185278,0,1

5) You now should have a file that contains the multiple simulation sets from Seq-Gen. For my example I got the following file (only two of 100 data sets shown):

 6 50
SEQ1 ATACGCATTATTACGCAACTTATACGGAGCCCCGTTTCAAACCGTGTCCC
SEQ2 GTACGGATTATTAGAAAACTTTAGCGCAGCCCGGTTTCGAACCGTGTGTC
SEQ3 CTACGCATTACTATCACACTTTATCGCAGCCCGGTTTCCAACCGTGTCAC
SEQ4 GTACCCATTATTAGAACACTTTAACGCAGCCCGGTATCCAACCGTGTGGC
SEQ5 TTACGCATTATTAGGAAACTTTATGGAAGCCCGGTATCAGACCGTGTCCC
SEQ6 TTAAGCATTATTACCCGACTTATACGGAGCCCCGTTTCATACGGTGTCGC
 6 50
SEQ1 AAACTTTCTTATCAACTGGACGAAGAAGTCACCCGCCCCTTGGCGAACCA
SEQ2 AATTTTTGGTATCAAATGGACGAAGAGCGGACCCGCGCCTTGGCCAGCCA
SEQ3 CATCGTTGGTATCAAATAGACGAAGAGCCGACCCGCGCCTTGGCACGCCT
SEQ4 TATTTTTGGTATCAAATGGACGAAGACCGGACCCGCGCCTTGGCCAGCCT
SEQ5 AAATATTCGTATCAACTCGACGAAGAAGGGACCCGCCCCTTGGCCAACCA
SEQ6 GAACTTTGTTATCAACTGGACGAAGAAGTCACCCGCCCCTTGGCGAACCC

Step 3: Changing the Seq-Gen output files to PAUP* files.

    This step will require you to find and replace sections of the Seq-Gen output file so that it will run on PAUP*.

1) Open the Seq-Gen output file in a text editor (I suggest using BBEdit - this programs does not place returns on long lines and has a great find and replace feature)

2) Below will be a portion of a PAUP block that you will need to edit slightly. You will need to change the number of taxa (ntaxa=6) to reflect the number you are working with, you will need to change the sequence length (nchar=50) to your sequence length, and you may need to change the name of your null topology to reflect the null topology that you have saved in nexus format (gettrees file = T1;). NOTE: The PAUP block that is used should create the same settings that were used to analyze the observed (real) data.


            ;
end;

begin paup;
[! • • • Calculating Likelihood Scores • • •]
set criterion = likelihood;
           Lset NST = 6 RMatrix = estimate
           Rates = gamma shape = estimate;
HSearch start = NJ nbest = 1 status = no;
savetrees file = TML append = yes format = phylip;
Lscore / scorefile = TML_Score append = yes;
gettrees file=T1;
Lscore / scorefile = T1_Score append = yes;
end;

Begin data;
           Dimensions ntax=6 nchar=50;
           Format datatype=nucleotide gap=- missing=? matchchar=.;
           Matrix


3) Replace the numbers that correspond to the number of sequences and the length of the sequences (e.g. 6 50 in my example) with the entire block that you have edited above. This should begin with the semicolon ì;î and end with ìMatrixî. (You will need to replace every occurrence of ì6 50î in the data set (e.g. 100 to 1000 times).

4) You will now need to make a few small changes to the beginning and the end of the file. (I suggest saving this file under a new name before you make any more changes)

5) Delete the beginning lines of the file up to but not including ìBegin Dataî and add ì#NEXUSî to the first line of the file.

6) At the very end of the file you will also need to paste the block from above and then remove the section beginning with ìBegin Dataî and ending with ìMatrixî. Now you should have a true PAUP* Block that looks something like the following:


#NEXUS

Begin data;
     Dimensions ntax=6 nchar=50;
     Format datatype=nucleotide gap=- missing=? matchchar=.;
     Matrix

SEQ1     ATACGCATTATTACGCAACTTATACGGAGCCCCGTTTCAAACCGTGTCCC
SEQ2     GTACGGATTATTAGAAAACTTTAGCGCAGCCCGGTTTCGAACCGTGTGTC
SEQ3     CTACGCATTACTATCACACTTTATCGCAGCCCGGTTTCCAACCGTGTCAC
SEQ4     GTACCCATTATTAGAACACTTTAACGCAGCCCGGTATCCAACCGTGTGGC
SEQ5     TTACGCATTATTAGGAAACTTTATGGAAGCCCGGTATCAGACCGTGTCCC
SEQ6     TTAAGCATTATTACCCGACTTATACGGAGCCCCGTTTCATACGGTGTCGC
     ;
end;

begin paup;
[! • • • Calculating Likelihood Scores • • •]
set criterion = likelihood;
     Lset NST = 6 RMatrix = estimate
     Rates = gamma shape = estimate;
HSearch start = NJ nbest = 1 status = no;
savetrees file = TML append = yes format = phylip;
Lscore / scorefile = TML_Score append = yes;
gettrees file=T1;
Lscore / scorefile = T1_Score append = yes;
end;

Begin data;
     Dimensions ntax=6 nchar=50;
     Format datatype=nucleotide gap=- missing=? matchchar=.;
     Matrix

SEQ1     AAACTTTCTTATCAACTGGACGAAGAAGTCACCCGCCCCTTGGCGAACCA
SEQ2     AATTTTTGGTATCAAATGGACGAAGAGCGGACCCGCGCCTTGGCCAGCCA
SEQ3     CATCGTTGGTATCAAATAGACGAAGAGCCGACCCGCGCCTTGGCACGCCT
SEQ4     TATTTTTGGTATCAAATGGACGAAGACCGGACCCGCGCCTTGGCCAGCCT
SEQ5     AAATATTCGTATCAACTCGACGAAGAAGGGACCCGCCCCTTGGCCAACCA
SEQ6     GAACTTTGTTATCAACTGGACGAAGAAGTCACCCGCCCCTTGGCGAACCC
begin paup;
[! • • • Calculating Likelihood Scores • • •]
set criterion = likelihood;
     Lset NST = 6 RMatrix = estimate
     Rates = gamma shape = estimate;
HSearch start = NJ nbest = 1 status = no;
savetrees file = TML append = yes format = phylip;
Lscore / scorefile = TML_Score append = yes;
gettrees file=T1;
Lscore / scorefile = T1_Score append = yes;
end;


7) Save your new PAUP* file in the same folder that you have your null hypothesis tree topology in Nexus format (T1).

Step 4: Run the PAUP* Block and analyze the results.

1) You can now open your new PAUP* file and execute it. It will not give you any feedback on how it is going because you have turned the status off. (This is needed so that it will run all of the files without you having to press the return key after every completed run - how would you like to have to hit the return 1000 times.)

2) You can check up on the program by opening the files named "TML_Score" or "T1_Score". The program will finish with n number of lines in the file; n being the number of simulation sets you specified in the Seq-Gen program.

3) When the block is finished and you have all of the data, you will need to get the distribution of likelihood differences. I used Microsoft Excel to open the files "TML_Score" and "T1_Score" and then calculated the difference (LML - L1) for each sample set. (NOTE: the likelihood scores in each of these files are in order so you need to calculate the difference using the order that they are in)

4) Finally, test whether the difference in log likelihoods from the observed data (calculated in Step 1) falls below the 95% point of the ranked list of differences calculated from the simulated data sets. (This corresponds to a one tailed test at a significance level of 5%).



PART 2: Partial Optimization of the SOWH Test

    The only real difference between the full optimization and the partial optimization is that you will not be estimating the free parameters while you are searching for the best tree. Instead, you will be using the free parameters estimated from the null topology when searching for the best topology. If you run this partial optimization program and you do not reject the null topology then you would have gotten the same result using the full optimization version. But if do reject the null topology then it is uncertain if you would have rejected it using the full optimization version.

    This partial optimization method is further discussed as the second of two partial optimization "posPpud"; tests in Goldman et al. (2000).

In this partial optimization method, all of the parameters of the substitution model are fixed at their values under T1 and are not re-optimised as in the full optimization SOWH method. The maximum likelihood topology is determined using these fixed rates and the branch lengths of the ML topology are optimised. Thus, much computer time is saved by not simultaneously searching for the ML topology and estimating the free parameters.

1) To run the partial optimization version, you will just need to replace the partial paup Block in Step 3 number 2 with the partial block below:


           ;
end;

begin paup;
[! • • • Calculating Likelihood Scores • • •]
set criterion = likelihood;
           Lset NST = 6 RMatrix = estimate
           Rates = gamma shape = estimate;
gettrees file=T1;
Lscore / scorefile = T1_Score append = yes;
set criterion = likelihood;
           Lset NST = 6 RMatrix = previous
           Rates = gamma shape = previous;
HSearch start = stepwise nbest = 1 status = no;
savetrees file = TML append = yes format = phylip;
Lscore / scorefile = TML_Score append = yes;

Begin data;
           Dimensions ntax=6 nchar=50;
           Format datatype=nucleotide gap=- missing=? matchchar=.;
           Matrix


REFERENCES.

Rambaut, A. and N. C. Grassly (1997). "Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees." Comput. Appl. Biosci. 13(3): 235-238.

Swofford, D. L. (1999). PAUP 4.0: Phylogenetic Analysis Using Parsimony (And Other Methods). Sunderland, MA, Sinauer Associates, Inc.

Yang, Z. (1997). "PAML: a program package for phylogenetic analysis by maximum likelihood." CABIOS 13: 555-556.