simNGS and simLibrary – Software for Simulating Next-Gen Sequencing Data


simNGS is software for simulating observations from Illumina sequencing machines using the statistical models behind the AYB base-calling software. By default, observations only incorporate noise due to sequencing and do not incorporate effects from more esoteric sources of noise that may be present in real data ("dust", bubbles, merged clusters, sequence-heterogeneous clusters, etc). Many of these additional sources may optionally applied.

simNGS takes fasta format sequences and a file describing the covariance of noise between bases and cycles observed in an actual run of the machine, randomly generates noisy intensities representing the signals for the sequence at each cycle and calculates likelihoods for all possible base calls.

More infomation is available in the documentation accompanying the code.


  • 02 May 2013. Support for Casava
    The current format of naming sequences is not compatible with those produced by Casava and the sequences produced cannot be used with many tools as a result. This version adds the option of outputing casava compatible sequence names, with thanks to Roman Valls Guimera, Guillermo Carrasco & Pär Engström.
  • 01 Jan 2012. simNGS 1.6, simLibrary 1.3
    Mutate option moved from simNGS to simLibrary. Edits during mutation process are described using CIGAR-type strings added to the sequence header.
    Interaction matrix used rather than separate M and P matrices for the cross-talk and phasing.
    Compilation fix when compiler is strict about order of linking (thanks to Dag Lyberg).
    Support different adapters at different ends of paired-end reads (request from Daniel Goodman).
  • 7 July 2011. Bug fix
    Zero length fragments occasionally generated on some platforms (recent Ubuntu) by simLibrary, causing simNGS to crash. SimLibrary fixed and simNGS now detects and skips zero length fragments.
  • 30 June 2011. simNGS 1.5.1 , simLibrary 1.2.2
    Add unique ID to read produced by simLibrary to prevent reads with same ID occurring in output.
    Output of simNGS can be a file, paired-ends split into separately.
  • 02 June 2011. Bug fix
    Single-ended run files handled incorrectly, falsely assumed to be paried-end with second end returning poor sequence.
  • 23 May 2011. simLibrary 1.2.1 released
    Optional upper and lower boundaries on generated fragment length to simulate a gel cut (requested change).
  • 12 May 2011. Version 1.5 released
    Observation error modelled using elliptic distribution with log-normally distributed radius rather than multivariate normal.
    Support for different distribution for brightness. Requires new format of runfiles (version 5)
  • 6 May 2011. Version 1.3.1 released
    Support for generalised error incorporated into qualities.
    New runfiles: paired end 101 cycle HiSeq using TruSeq chemistry.
  • 28 Apr 2011. Version 1.3.0 released
    Raw intensities may be dumped, given cross-talk, phasing and noise matrices.
    Support for dust generation, given cross-talk, phasing and noise matrices.
    Optionally output fastq quality scores on Illumina scale (q+64) rather than Sanger scale (q+33).
  • 26 Apr 2011. Version 1.2.3 released
    Change of defaults to values more sensible for general simulation of sequence.
    Behaviour of --mutation option changed. On by default, turned off by giving flag with no parameters.
  • 21 Apr 2011. Version 1.2.2 released
    Inflate variance of final cycle when runfile is trimmed to shorter length (requested change).
  • 20 Apr 2011. Version 1.2.1 released
    Orientation of second-end of paired-end data now output opposite to that of first end (requested change).
  • 15 Oct 2010. Version 1.2 released
    simNGS now includes the simLibrary software for simulating sequencing libraries.
  • 05 Oct 2010.
    Added man page
  • 27 Apr 2010. Version 1.1 released
    Intensities can be randomly mixed to simulate merging of clusters and dirty the observed likelihood.
  • 12 Mar 2010.
    Important bugfix: error in initial calculation of likelihoods.
  • 03 Mar 2010.
    Output FASTQ format if requested.
  • 18 Feb 2010.
    Filter for dim clusters.
    Tabulate reads by number of errors and summarise.
  • 14 Feb 2010.
    Important bugfix: error in initial calculation of likelihoods.
    User specified adapter sequence, or a default, to pad reads that are shorter than the number of cycles.
  • 13 Feb 2010.
    Approximately 40% speed increase on platforms with optimised BLAS libraries.
  • 12 Feb 2010.
    Simple transform of input sequences (insertion, deletion, mutation) to model errors that may have occurred during sample preparation.
    Behaviour change: input sequences shorter than number of cycles for run get padded with ambiguity characters rather than discarded. In future this will probably change to padding with adapter sequence.
  • 11 Feb 2010.
    Output FASTA format sequences rather than intensities, if requested.
    Bug fixes:
    • Work around for bugs in glibc tgmath implementation.
    • Fixed inlining bug which prevents compilation without optimisation.
    • Exit with error if fail to read runfile given on commandline.
    • Correct parsing of " -n ncycle " argument.
    • Remove debugging code for correlated brighted and actually use values.
  • 30 Jan 2010.
    Correlation between brightness of cluster at one end of paired-end run and the other implemented using Gaussian copula function.
  • 29 Jan 2010.
    • Support for purity filtering of generated intensities.
    • Option to dump processed intensities to a file.
  • 27 Jan 2010.
    Additional features:
    • Display summary of bad calls in output after each run.
    • Describe a particular runfile.
  • 26 Jan 2010.
    Initial release of source code.


The simNGS software is freely available under the GNU General Public Licence version 3 (see for further information). A copy of the licence is provided with the software.

simNGS uses the optimised SFMT code for the Mersenne twister random number generator produced by Mutsuo Saito and Makoto Matsumotom, which is available from Hiroshima University under a three-clause BSD style licence.