AYB: Advanced Base Calling for Next Generation Sequencing Machines

About

image

AYB is a base caller for the Illumina Genome Analyzer, using an explicit statistical model of how errors occur during sequencing to produce more accurate reads from the raw intensity data.

In particular, AYB deals with three sources of error:

  • Cross-talk: There is overlap in the excitation spectra of the fluorophores used to label the nucleotides, leading to light emission being detected under several combinations of lasers and filters ("channels"). This effect is especally noticable for the fluorophores used to mark adenine and guanine, each of which is bright in two channels.
  • Phasing: As the number of cycles increases, the signal starts to blur as the cluster loses synchronicity: random failure of nucleotides to incorporate, or failure of the blocking element to prevent incorporation of more than one nucleotide mean that individual strands lag or lead and the signal detected at each cycle is a mixture of several positions along the read.
  • Contamination: Non-sequence contamination in the flow cell, microscopic particles of dust for example, get illuminated by the lasers and might be detected instead of sequence. Such contamination is generally abnormally bright compared to the surrounding sequence and so does not conform to what AYB expects, the quality scores for the called base being automatically down-weighted as a result.

In contrast to other base-calling approaching, AYB uses a general model of phasing estimated directly from the data rather than assuming that it occurs at a constant rate for all cycles. Dealing with phasing in this manner means that the base calls made by AYB at the end of each read tend to be more accurate than other methods, making greater read lengths feasible and increasing the number of the highest quality reads: AYB returning 2.8 times as many perfect reads than other base callers for 100 cycle data (with smaller gains for shorter reads).

By default AYB performs per-tile analysis, estimating phasing and cross-talk separately for every tile. This level of analysis is more processor intensive than the Illumina analysis pipeline but can be efficiently split between machines: an entire 8 lane run of 45 cycle data (95 million clusters) can be analysed within an hour on a modern eight-core server, as could 2 million clusters of much longer 101 cycle data. In addition AYB offers two options to reduce the total computational burden: fixing the cross-talk matrix across tiles, either at a value previously estimated by AYB or the Illumina pipeline, allows phasing to be solved analytically in each iteration and so speeding up estimation considerably; alteratively a Bustard-like approach can be used, estimating the cross-talk and phasing from a few tiles and then holding them fixed while calling bases for the remaining tiles.

Download

AYB is freely available under the GNU General Public Licence version 3 (see www.gnu.org for further information). A copy of the licence is provided with the software.

AYB Version II Source code

Latest version of AYB.

Build instructions for Version II are in the README file.

The Version II AYB Manual contains user information including program options.

AYB with generalised phasing model

This version of AYB is the one on which Massingham and Goldman (2012) is based.

Original AYBc Source code

Original pre-release version of AYB with older phasing model. Historic interest only.

Recalibration tool (suitable for all versions of AYB)

CIFTools

The ciftools package for manipulating CIF format intensities may also be useful.

Examples

AYB intensities

will process cif file intensities in one block using 5 iterations and output a fastq file, both in the current directory with log messages to stderr.

AYB -b R76R76 -i cifdir -o outputdir s_3_1301

will process a 76 base paired-end from the file s_3_1301.cif stored in the directory cifdir. Output will be stored in outputdir

AYB -i runfolder -b R8R108R108 -r L1T1301-2301

will process a 108 base paired-end run, with an additional 8 base index between the pairs, from a run folder. All the tiles between 1301 and 2301 will be processed from lane 1.

Data sets

PhiX - 76 cycle control lane (27 tiles). Sanger Institute.

B. pertussis - 76 cycle paired-end data from a problematic run (100 tiles). Sanger Institute.

HiSeq - 101 cycle paired-end data from a HiSeq machine with PhiX spike-in. Illumina corp.

Ibis Test - 51 cycle test set of data distributed with the Ibis base-caller

NA19240/BGI (archive) - 45 cycle paired-end data from BGI (part of 1KGP, pilot 2, individual NA19240)

NA19240/Illumina (archive) - 51 cycle paired-end data from Illumina (part of 1KGP, pilot 2, individual NA19240)

Paper

AYB paper (Genome Biology open access)

All Your Base: a fast and accurate probabilistic approach to base calling. T. Massingham and N. Goldman (2012) Genome Biology 13:R13

Figures

Fig 1. Comparison of error rates.
Fig 2. Frequency of errors for B. pertussis data.
Fig 3. Quality calibration comparison between AYB and Ibis.

Supplementary - Fitting a block tridiagonal information matrix by ML
Supplementary (old) - Rapid estimation of M, P and N
Basecalls - Basecalls for data sets in manuscript

Contact

Please direct any queries to ayb@ebi.ac.uk

News

20 December 2012

  • Support for compressed output
  • Add samplename to output

26 August 2012

  • Support for reading coordinates from run folder
  • Format of output read names now more in keeping with those from Illumina pipeline

31 May 2012

  • AYB Version 2.11
  • Thin missing data and general tidy up.

25 April 2012

  • AYB Version 2.10
  • Performance improvements including thin option. Memory leak fixed.

04 April 2012

  • AYB Version 2.09
  • Bug fixes to improve handling of certain patterns of missing data.

29 Feburary 2012

  • AYB paper published.
  • Massingham and Goldman (2012) All Your Base: a fast and accurate probabilistic approach to base calling Genome Biology 13:R13

21 February 2012

  • AYB Version 2.08
  • Option to use spike-in data to improve base calling and calibrate qualities.

16 December 2011

  • AYB Version 2.07
  • Option to run with multiple threads (with OpenMP).

01 December 2011

  • AYB Version 2.06
  • Implement improved quality scoring and robustness as in AYBg

26 October 2011

  • AYBg update
  • Improved quality scoring and robustness fixes
  • Basis for revised manuscript

18 October 2011

  • AYB Version 2.05
  • Implement improved modelling algorithm as in AYBg

14 Sept 2011

  • AYBg compilation fixes on Linux, reported by Yves Wetzels
  • Turn on openmp support, compile optimised by default
  • Remove dependency on Fortran compiler (Mac + Linux)

1 Sept 2011

  • AYBg update
  • Addition of quality calculation and calibration. No changes to base-call accuracy

29 July 2011

  • AYBg released
  • Much improved accuracy over previous AYB versions due to generalised phasing model. Basis for manuscript.

22 July 2011

  • AYB Version 2.04
  • Memory use reduction and changes to sim runfile format (version 5)

10 May 2011

  • AYB Version 2.03
  • Automated module and system testing and modelling refactored (no function change)

17 Feb 2011

  • AYB Version 2.02
  • Quality calibration table now contains values to use

08 Feb 2011

  • New release of recalibration tool produces values to use

21 Jan 2011

  • AYB Version 2.01
  • Cif from run-folder and quality calibration table

07 Dec 2010

  • First release of AYB Version II

28 Nov 2010

  • Performance improvements
  • Improved estimation of phasing

07 Oct 2010

  • Translation into C

21 May 2009

  • Initial release