simNGS [-a adapter] [-A file] [-b shape1:scale1:shape2:scale2] [-c correlation] [-d] [-D prob] [-F factor] [-f nimpure:ncycle:threshold] [-g prob] [-i filename] [-I] [-j range:a:b] [-l lane] [-n ncycle] [-N file] [-o output_format] [-p option] [-q quantile] [-r mu] [-R] [-s seed] [-t tile] [-v factor ] runfile [seq.fa … ]
simNGS is software for simulating "perfect" observations from Illumina sequencing machines using the statistical models behind the AYB base-calling software. Observations are perfect in the sense that they are generated according the model and do not incorporate any additional effects that may be present in real data ("dust", bubbles, merged clusters, sequence-heterogeneous clusters, etc).
simNGS takes fasta format sequences and a file describing the covariance of noise between bases and cycles observed in an actual run of the machine, randomly generates noisy intensities representing the signals for the sequence at each cycle and calculates likelihoods for all possible base calls.
Sequences are either read from files, whose names are given on the commandline, or from stdin. It is up to the user to make sure that the sequence names are unique. Results are written to stdout in the format specified by the -o, --output flag. Messages, progress indicators and a summary of errors in the generated data are written to stderr.
Ambiguous characters are interpretated as "no base present" and treated as a null call: the intensities generated are just noise and do not have an additional component from the sequence, leading to a less bright cluster. Artifacts such as bubbles and registration errors could be simulated by placing ambiguity characters in the sequence. Where the sequence to be called is shorter than the expected read length, either because the input was too short or deletion from the mutation model have shortened it, the sequence is padded with ambiguity characters for the remaining cycles.
- -a, --adapter sequence [default: AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGAT]
Sequence to pad reads with if they are shorter than the number of cycles required, reflecting the adapter sequence used for sample preparation. Different adapters for each end can be specified by -a ADAPTER1:ADAPTER2 The default is for both adapters to be the same (-a ADAPTER). A null adapter (i.e. pad with ambiguity characters) can be specified by --adapter (no argument)
- -A, --interaction filename [default: none]
File to read interaction matrix from. Not required for general simulation of sequence and qualities.
- -b, --brightness shape1:scale1:shape2:scale2 [default: as runfile]
Set parameters for the distribution of cluster brightness. The cluster brightness represents the combination of the incident intensity of the laser on the each cluster and the number of fluropores emitting (roughly the size of the cluster). The distribution of brightness is modelled as a Weibull distribution with given shape and scale, real valued numbers greater than zero, with mean and variance:
Mean: scale \Gamma( 1 + 1/shape)
Variance: scale^2 \Gamma( 1 + 2/shape ) - mean^2
where \Gamma is the Gamma function.
If the brightness distribution is not set, a distribution specified in the runfile will be used. Said distribution may not be Weibull and is specified by one of the following codes::
- -c, --correlation [default: 1.0]
Correlation between brightness of one end of a paired-end run and the other. Correlation is implemented using a Gaussian copula, the marginal distributions being Weibull with parameters specified in the runfile or on the commandline, so its value should belong to [-1,1].
- -d, --describe
Print a description of the runfile and exit.
- -D, --dust probability [default: no dust]
Probability of dust occurring on a particular cycle, resulting in extremely bright observations in the second channel. Cross-talk, phasing and noise matrices must be specified. A value of 1e-5 is typical.
- -F, --final factor [default: see text]
Increase variance of noise for the final cycle by given factor. When the number of cycles to simulate is equal to that which the runfile is trained on, a factor of one is used by default. When the number of cycles required is fewer, a scaling is learned by comparing the variance of the final two cycles. Factor can either be a scalar, in which case the noise for all bases is scaled equally, or a quad of colon separated multipliers to scale each base by a different factor.
- -f, --filter nimpure:ncycle:threshold [default: no filtering]
Use purity filtering on generated intensities, allowing a maximum of nimpure cyles in the first ncycles with a purity greater than threshold. If the greatest intensity at a cycle is M and the second greatest is N then the purity is M / (M+N).
- -g, --generalised, --generalized probability [default: set from mutation rate]
Probability of a generalised error, a mistaken base not due to base calling error.
- -i, --intensities filename [default: none]
Write (processed) intensities to "filename" in a format identical to that as the likelihoods but with each likelihood replaced by its corresponding intensity.
- -I, --illumina
Produce Illumina scaled quality values where required. Ascii representation of quality value is ascii(quality+64) rather than the more usual ascii(quality+33).
- -j, --jumble range:a:b [default: none]
Jumble generated intensities with those of other reads, simulating cases where clusters have merged. A random read is picked from the <range> reads after each each and their intensities mixed to final intensities for analysis. Mixing is according to a Kumaraswamy distribution with parameters a and b, a being the shape parameter for the current read and b that for the randomly picked read. When the final read is inputed, all reads in the last <range> are mixed with each other; if there where less than <range> reads in inputed in total, all reads are mixed together.
- -l, --lane lane [default: as runfile]
Set lane number in the output to "lane". Lane number must be greater than zero but need not be a number valid for a real machine.
- -n, --ncycle ncycle [default: as runfile]
Number of cycles to generate likelihoods for, up to maximum allowed in the runfile. If more cycles are asked for than allowed, a warning is printed and the number of cycles outputted is restricted to number specificed in the runfile.
- -N, --noise filename [default: none]
File to read systematic noise matrix from. Not required for general simulation of sequence and qualities.
- -o, --output format [default: fastq]
Format in which to output results. Either "likelihood", "fasta" or "fastq". "likelihood" is the format specified in the file README.txt, containing the likelihoods of all possible base-calls. "fasta" gives the called sequences in FASTA format, paired-end reads being concatenated together. Similar "fastq" outputs sequences and qualities in fastq format; unless over-ridden setting the output format to "fastq" sets mu to 1e-5 to better calibrate the qualities for polymerase error. See --robust for details.
- -p, --paired option [default: single]
Treat run as paired-end. Valid options are: "single", "paired", "cycle".
For "cycle" reads are treated as paired-end but the results are reported in machine cycle order (the second end is reverse complemented and appended to first). The "paired" option splits the two ends of the read into separate records, indicated by a suffix /1 or /2 to the read name. The second end is reported in the opposite orientation to the first.
For single-ended runs treated as paired, the covariance matrix is duplicated to make two uncorrelated pairs with identical error characteristics. For paired-end runs treated as single, the second end is ignored. The covariance matrices for the ends of a paired-end run were estimated and stored separately, implying an assumption of independence between the two ends. The brightness of the cluster is assumed to be equal for both ends, which can be changed using the --correlation option.
- -q, --quantile quantile [default: 0]
Quantile below which cluster brightness is discarded and redrawn from distribution. If the brightness of either end of the read is in the lower "quantile" quantile of the brightness distribution, then discard the values and redraw. Very dim clusters would be lost in the image analysis of an actual run and never found.
- -r, --robust mu [default: 1e-5]
Output robustified likelihood, equivalent to adding mu to the likelihood of every call. "mu" should be a real number greater than zero. The default for simNGS is to produce raw likelihoods so errors outside the scope of the base calling can be modelled elsewhere; the default for the AYB base calling software uses a value of 1e-5 to compensate for unmodelled effects like "dust" on the slide or sample preparation errors.
The consequences of adding a "mu" is to bound the maximum possible likelihood-ratio statistic, and so the quality of the call, and to pull extremely small likelihoods where the statistical model does fit the observation for any possible base call back towards equi-probable. The theory behind this is consider the observation to be a mixture of the model and small proportion of contaminant, then define the robustified likelihood to be the supremum over all possible contaminants (i.e. allow the contaminant to be the worst possible).
- -R, --raw
Convert intensities to raw intensities before outputing. Requires cross-talk, phasing and noise matrices to be specified.
- -s, --seed seed [default: clock]
Seed used to set random number generator. Where no seed is given, it is initialised from the clock of the machine and written to stderr. Care should be taken if multiple jobs are started simultaneously, on a farm with synchronised clocks for example, and an explicit seed is not set.
- -t, --tile tile [default: as runfile]
Set tile number in the output to "tile". Tile number must be greater than zero but need not be a number valid for a real machine.
- -v, --variance factor [default: 1.0]
Factor with which to scale variance matrix by and so increase or decrease the amount of background noise generated. Multiplying the variance by a factor f is mathematically equivalent to multiplying the scale of the brightness distribution by 1/sqrt(f).
Produces fastq results for sequences in test100.fa and outputs to test.fq, treating the (single-ended) runfile "s_2_0005.runfile" as paired-end.
cat test100.fa | simNGS -p paired data/s_2_0005.runfile > test.fq
Reads from the sequence files seq1.fa and seq2.fa and outputs likelihood information to the file seq.like.A
simNGS -o likelihood data/s_2_0005.runfile seq1.fa seq2.fa > seq.like
Cluster coordinates are randomly generated according to a uniform distribution on the x and y axes and so may coincide. The output may break other software that relies on the coordinates to produce a unique identifier for each cluster.
The jumble option does not locality into account when mixing clusters, so distant clusters are as likely to be mixed as nearby ones. In reality, the mixing process is highly localised.
Written by Tim Massingham, <firstname.lastname@example.org>
simNGS uses the optimised SFMT code for the Mersenne twister random number generator produced by Mutsuo Saito and Makoto Matsumotom, which is available from Hiroshima University under a three-clause BSD style licence.
Copyright © 2010 European Bioinformatics Institute. Free use of this software is granted under the terms of the GNU General Public License (GPL). See the file COPYING in the simNGS distribution or http://www.gnu.org/licenses/gpl.html for details.
The included SFMT library is copyright © 2006,2007 Mutsuo Saito, Makoto Matsumoto and Hiroshima University, and is distributed under a three-clause BSD style licence. Seee http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/SFMT/.