AYB is a base caller for the Illumina Genome Analyzer, using an explicit statistical model of how errors occur during sequencing to produce more accurate reads from the raw intensity data.
In particular, AYB deals with three sources of error:
- Cross-talk: There is overlap in the excitation spectra of the fluorophores used to label the nucleotides, leading to light emission being detected under several combinations of lasers and filters ("channels"). This effect is especally noticable for the fluorophores used to mark adenine and guanine, each of which is bright in two channels.
- Phasing: As the number of cycles increases, the signal starts to blur as the cluster loses synchronicity: random failure of nucleotides to incorporate, or failure of the blocking element to prevent incorporation of more than one nucleotide mean that individual strands lag or lead and the signal detected at each cycle is a mixture of several positions along the read.
- Contamination: Non-sequence contamination in the flow cell, microscopic particles of dust for example, get illuminated by the lasers and might be detected instead of sequence. Such contamination is generally abnormally bright compared to the surrounding sequence and so does not conform to what AYB expects, the quality scores for the called base being automatically down-weighted as a result.
In contrast to other base-calling approaching, AYB uses a general model of phasing estimated directly from the data rather than assuming that it occurs at a constant rate for all cycles. Dealing with phasing in this manner means that the base calls made by AYB at the end of each read tend to be more accurate than other methods, making greater read lengths feasible and increasing the number of the highest quality reads: AYB returning 2.8 times as many perfect reads than other base callers for 100 cycle data (with smaller gains for shorter reads).
By default AYB performs per-tile analysis, estimating phasing and cross-talk separately for every tile. This level of analysis is more processor intensive than the Illumina analysis pipeline but can be efficiently split between machines: an entire 8 lane run of 45 cycle data (95 million clusters) can be analysed within an hour on a modern eight-core server, as could 2 million clusters of much longer 101 cycle data. In addition AYB offers two options to reduce the total computational burden: fixing the cross-talk matrix across tiles, either at a value previously estimated by AYB or the Illumina pipeline, allows phasing to be solved analytically in each iteration and so speeding up estimation considerably; alteratively a Bustard-like approach can be used, estimating the cross-talk and phasing from a few tiles and then holding them fixed while calling bases for the remaining tiles.
AYB is freely available under the GNU General Public Licence version 3 (see www.gnu.org for further information). A copy of the licence is provided with the software.
AYB Version II Source code
Latest version of AYB.
Build instructions for Version II are in the README file.
The Version II AYB Manual contains user information including program options.
AYB with generalised phasing model
This version of AYB is the one on which Massingham and Goldman (2012) is based.
Original AYBc Source code
Original pre-release version of AYB with older phasing model. Historic interest only.
Recalibration tool (suitable for all versions of AYB)
The ciftools package for manipulating CIF format intensities may also be useful.
will process cif file intensities in one block using 5 iterations and output a fastq file, both in the current directory with log messages to stderr.
AYB -b R76R76 -i cifdir -o outputdir s_3_1301
will process a 76 base paired-end from the file s_3_1301.cif stored in the directory cifdir. Output will be stored in outputdir
AYB -i runfolder -b R8R108R108 -r L1T1301-2301
will process a 108 base paired-end run, with an additional 8 base index between the pairs, from a run folder. All the tiles between 1301 and 2301 will be processed from lane 1.
PhiX - 76 cycle control lane (27 tiles). Sanger Institute.
B. pertussis - 76 cycle paired-end data from a problematic run (100 tiles). Sanger Institute.
HiSeq - 101 cycle paired-end data from a HiSeq machine with PhiX spike-in. Illumina corp.
Ibis Test - 51 cycle test set of data distributed with the Ibis base-caller
AYB paper (Genome Biology open access)
All Your Base: a fast and accurate probabilistic approach to base calling. T. Massingham and N. Goldman (2012) Genome Biology 13:R13
Please direct any queries to email@example.com
20 December 2012
- Support for compressed output
- Add samplename to output
26 August 2012
- Support for reading coordinates from run folder
- Format of output read names now more in keeping with those from Illumina pipeline
31 May 2012
- AYB Version 2.11
- Thin missing data and general tidy up.
25 April 2012
- AYB Version 2.10
- Performance improvements including thin option. Memory leak fixed.
04 April 2012
- AYB Version 2.09
- Bug fixes to improve handling of certain patterns of missing data.
29 Feburary 2012
- AYB paper published.
- Massingham and Goldman (2012) All Your Base: a fast and accurate probabilistic approach to base calling Genome Biology 13:R13
21 February 2012
- AYB Version 2.08
- Option to use spike-in data to improve base calling and calibrate qualities.
16 December 2011
- AYB Version 2.07
- Option to run with multiple threads (with OpenMP).
01 December 2011
- AYB Version 2.06
- Implement improved quality scoring and robustness as in AYBg
26 October 2011
- AYBg update
- Improved quality scoring and robustness fixes
- Basis for revised manuscript
18 October 2011
- AYB Version 2.05
- Implement improved modelling algorithm as in AYBg
14 Sept 2011
- AYBg compilation fixes on Linux, reported by Yves Wetzels
- Turn on openmp support, compile optimised by default
- Remove dependency on Fortran compiler (Mac + Linux)
1 Sept 2011
- AYBg update
- Addition of quality calculation and calibration. No changes to base-call accuracy
29 July 2011
- AYBg released
- Much improved accuracy over previous AYB versions due to generalised phasing model. Basis for manuscript.
22 July 2011
- AYB Version 2.04
- Memory use reduction and changes to sim runfile format (version 5)
10 May 2011
- AYB Version 2.03
- Automated module and system testing and modelling refactored (no function change)
17 Feb 2011
- AYB Version 2.02
- Quality calibration table now contains values to use
08 Feb 2011
- New release of recalibration tool produces values to use
21 Jan 2011
- AYB Version 2.01
- Cif from run-folder and quality calibration table
07 Dec 2010
- First release of AYB Version II
28 Nov 2010
- Performance improvements
- Improved estimation of phasing
07 Oct 2010
- Translation into C
21 May 2009
- Initial release