Subsections

3 Running FREGENE

FREGENE can be run in a terminal or using a shell file. The minimal command has the form
fregene -i infile1 -p infile2 -recomb infile3 -o outfile
Input parameters are specified in the three input files, -i, -p, and -recomb, but some can also be set via additional command line options (see Section 5). However, the input files specified by the -i, -p, and -recomb arguments are always required and can be used to set all parameters. If a variable is assigned a value in both the command line and an input file, the latter setting prevails. However, if the -SELECT command line option is not set, any selection-related parameters in the input files will be ignored. Similarly for the -mg option and any parameters related to population subdivision.

FREGENE can readily be used without studying all the options in this document, by modifying the files specified by the -i, -p, and -recomb arguments of Example/fregene_example.sh.

3.1 Input Files


Initial population (-i file_name; required)


Table 1: Optional tags for the -i input file. These are written to the -o output file and hence are always set if the output file is used as input for a subsequent run of FREGENE. Variables are described in further detail in Section 4.2.
Variable Name     (Default, if applicable) Description
$ <IS\_SCALED>$     (1) 1 if the output population is scaled, 0 if unscaled
$ <SCALING\_FACTOR>$     (1) Specifies the scaling factor (real $ \ge1$ )
$ <GROUPS>$     (1) Number of subpopulations (integer $ \ge1$ )
$ <GROUPS\_SIZE>$ Sizes of the subpopulations (# chromosomes; one even integer for each subpopulation, separated by spaces)
$ <SEED>$     (1) Seed for the random number generator (integer)
$ <NB\_SWAPPED\_SITES>$     (0) Number of swapped sites ($ i.e.$ sites at which the minor allele is ancestral)
$ <LIST\_SWAPPED\_SITES>$ Positions of all ``swapped'' sites
$ <NB\_SELECTED_\_SITES>$     (0) Number of sites under selection
$ <POSN\_SELECTED\_SITES>$ Positions of sites under selection
$ <SEL\_GENERATION>$     (0) The generations when each selected allele arose
$ <SEL\_COEF>$ The selection coefficient $ s$ of selected sites
$ <SEL\_DOM>$ The dominance coefficient $ h$ of selected sites
$ <GROUP\_SELECTED\_SITES>$ The subpopulation(s) in which the site is under selection. If =100, the site is globally under selection


This xml-format input file details the chromosomes in the starting population. Most often the initial population will either be

The -o output file is structured so that it can immediately become the -i input file for a subsequent run of FREGENE.

The following tags are required and do not have a default value:

The optional tags, mainly related to simulation and output options, are briefly described in Table 1. Some of these can also be set in the command line (see Section 5).

See Example/data/in_example.xml for an example with an invariant starting population, and Example/data/rin_example.xml for an example in which the starting population has been generated by a previous FREGENE run.

3.1.1 Evolutionary and simulation parameters (-p file_name; required)


Table 2: Tags that can appear in the -p parameter file.
Simulation and mutation parameters
Variable Name     (Default, if applicable) Description

$ <NO\_GENER>$     (0) Length of the simulation run in generations.
$ <DELETION\_INTERVAL>$     (0) # of generations between deletion operations Deletion operations check each site for alleles that have gone to fixation or vanished, or for a minor allele that has become the major allele or vice-versa. Homozygozity is also computed. Default value of 0 means these operations are performed every generation, which carries a computational overhead.
$ <MIGRATION>$     (I) Matrix giving backward migration rates between subpopulations. Ignored if -mg option is not in command line.
$ <MUTAT\_RATE>$     ( $ 2.5\times10^{-8}$ ) Mutation rate (/site/generation)
Parameters of the selection model (-SELECT must be set in command line)
Variable Name     (All default to 0) Description

$ <PROP\_SEL>$ Proportion of new alleles that are non-neutral
$ <SEL\_COEF\_POS>$ Mean for the positive distribution of $ s$
$ <SEL\_COEF\_SD\_POS>$ SD for the positive distribution of $ s$
$ <SEL\_COEF\_NEG>$ Mean for the negative distribution of $ s$
$ <SEL\_COEF\_SD\_NEG>$ SD for the negative distribution of $ s$
$ <PROP\_POS\_SEL\_COEF>$ Mixture weight of the positive distribution of $ s$
$ <SEL\_DOM\_POS>$ Mean for the positive distribution of $ h$
$ <SEL\_DOM\_SD\_POS>$ SD for the positive distribution of $ h$
$ <SEL\_DOM\_NEG>$ Mean for the negative distribution of $ h$
$ <SEL\_DOM\_SD\_NEG>$ SD for the negative distribution of $ h$
$ <PROP\_POS\_SEL\_DOM>$ Mixture weight of the positive distribution of $ h$
$ <PROP\_SEL\_LOCAL>$ Proportion of selected sites that are only under selection in the sub-population where they arose (Only used if -mg is in command line)


This file specifies mutation and selection parameters, and parameters that control some details of the simulation run. See Example/data/par_example.xml for an example. Table 2 briefly describes the tags. To implement selection, the minimal FREGENE command is

fregene -SELECT -i infile1 -p infile2 -recomb infile3 -o outfile
The fitness ($ W$ ) of an individual is obtained by summation over non-neutral SNPs:

$\displaystyle W_{i}=1+\sum_{j}x_{j}$ (1)

where

$\displaystyle x_j=\left\{\begin{array}{rl}
0&\mbox{if the individual is an anc...
...&\mbox{if heterozygote}\\
s&\mbox{if derived homozygote.}
\end{array}\right.$

When a mutation occurs, it is under selection with probability $ <PROP\_SEL>$ . The intensity coefficient $ s$ (identified as *_COEF_* in the parameter file) and dominance coefficient $ h$ (referred as *_DOM_*) are each sampled as a mixture of two Gaussian distributions. For convenience, the first of these distributions is called ``positive'' (labelled *_POS) and the second is called ``negative'' (*_NEG), but their values need not reflect these labels. The user specifies the relative weight (between 0 and 1) of the positive distribution (PROP_POS_SEL*). If $ =1$ , the negative distribution parameters are ignored; if $ =0$ , the positive distribution parameters are ignored.

When a new selected site arises in a subdivided population, with probability PROP_SEL_LOCAL it is under selection only in the subpopulation where it arose. Otherwise, the site is under selection in all subpopulations.

Finally, each selected site is ``switched off'' ($ i.e.$ its selection and dominance coefficients are set to 0) with a probability specified by the -sel_LE option (Table 4). This is intended to allow the user to avoid accumulation of large numbers of sites under balancing selection, and also allows an equilibrium to be reached even when balancing selection is present. At each generation, a selected site is switched off with default probability 1/75,000 (corresponding to a mean time under selection of 75,000 generations if neither allele reaches fixation).

3.2 Recombination parameters (-recomb file_name; required)


Table 3: Tags that can appear in the recombination file.
Variable Name     (Default, if applicable)  Description
$ <GC\_RATE>$     (0)  Rate of gene conversion (GC) start sites (/bp/gener)
$ <GC\_LENGTH>$     (500)  GC tract length (bp)
$ <RECOM\_RATE>$     ($ 10^{-8}$ )  Average crossover (CO) rate (/bp/gener)
$ <N\_REGIONS>$     (1)  Number of regions in each chromosome
$ <SUBS\_PER\_REGION>$     (1)  Number of sub-regions per region.
$ <REGION\_GAMMA\_SCALE>$     (1)  Scale and shape parameter of the Gamma distribution
$ <REGION\_GAMMA\_SHAPE>$     (1)  used to determine the rate for each region.
$ <SUB\_REGION\_GAMMA\_SHAPE>$     (1)  Shape parameter of the Gamma distribution for rates within each sub-region (scale defined by the overall rate in the region).
$ <PROP\_RECOM\_HS>$     (0)  Proportion of CO occurring in hotspots
$ <HS\_LENGTH>$     (200)  Length of CO hotspots (bp)
$ <HS\_SPACING>$     (5000)  Average distance between HS (bp)
$ <HS\_SPACING\_GAMMA\_SHAPE>$     (1)  Shape parameter of the Gamma distribution for distance between HS.
$ <INTENSITY\_GAMMA\_SHAPE>$     (1)  Shape parameter of the Gamma distribution for additional CO intensity within HS
$ <HS\_COMB>$     (0)  1 if GC start sites have the same distribution as CO. 0 if GC start sites are sampled uniformly.


The recombination model is hierarchical, and is highly flexible, allowing a uniform recombination rate, or rates that can vary both on a fine scale (hotspots) and on a broad scale.

Chromosomes are divided into N_REGIONS equal-size regions, each of which is subdivided into SUBS_PER_REGION equal-size subregions. The mean per-site recombination rate within a region is initially sampled from a Gamma distribution, with scale and shape parameters REGION_GAMMA_SCALE and REGION_GAMMA_SHAPE. However, the realised values are normalised so that the overall mean recombination rate is equal to RECOM_RATE. Thus, if there is only one region, its recombination rate is equal to RECOM_RATE irrespective of the parameters of the Gamma distribution. (NB in our parameterisation, the Gamma distribution with scale parameter $ \alpha $ and shape parameter $ \beta $ has mean $ \alpha \beta $ and variance $ \alpha \beta ^2$ .)

Similarly, in each subregion the recombination rate is sampled from a Gamma distribution, but in this case there is no normalising. The shape parameter is specified by the user (SUB_REGION_GAMMA_SHAPE), but the scale parameter is fixed by FREGENE so that the mean equals the region mean rate. Within each subregion, hotspots of fixed length (HS_LENGTH) are sampled such that the distance between hotspots follows a Gamma distribution with mean HS_SPACING and shape parameter HS_SPACING_GAMMA_SHAPE. From the mean recombination rate for the subregion, and the proportion of recombinations that occur in hotspots (PROP_RECOM_HS), FREGENE computes a background rate that applies to all sites as well as a mean rate within hotspots. The excess rate above background in a particular hotspot is sampled from a Gamma distribution with variance defined by its shape parameter (INTENSITY_GAMMA_SHAPE).

The start sites of gene conversions, with tract length GC_LENGTH, can be sampled uniformly (HF_COMB=0), or in proportion to crossover rates (HF_COMB=1) but with overall rate specified by GC_RATE.

Imperial College -- August 2008