SYNOPSIS

AYB [-b blockstring] [-d input format] [-e log file] [-f output format] [-g generr] [-i input path] [-k] [-l log level] [-n iterations] [-o output path] [-p threads] [-q] [-r] [-s header] [-t factor] [-w level] [-A Parameter A] [-K spike-in path] [-M Crosstalk] [-N Noise] [-Q quality tab] prefix[+]/lane tile range [prefix[+]/lane tile range …]

AYB --help

AYB --licence

AYB --license

AYB --version

EXAMPLE

AYB intensities

will process cif file intensities in one block using 5 iterations and output a fastq file, both in the current directory with log messages to stderr.

DESCRIPTION

AYB is an advanced basecaller for the Illumina sequencing platform, producing basecalls and associated quality measures from raw intensity information.

AYB selects intensity files using the input option location (if any) and command line prefix arguments supplied. A prefix may also contain a full or partial path. If a prefix is followed by a ‘+’ then it is treated as a prefix, else the file match is exact.

Raw intensities can be either cif or standard illumina (txt) format. AYB looks for files matching one of the following templates:

cif

{prefix}[*].cif

txt

{prefix}[*]_int.txt*[.{zipext}]

If cif is selected then intensities may alternatively be located in multiple files in a run-folder. See the runfolder option for details.

The name of an intensities file without the extension (cif) or the part of the name up to the ‘_int’ (txt) will be referred to elsewhere as the ‘filename’.

AYB can process an intensities file as a single block or be instructed to group the data by cycle into multiple blocks and process separately. This allows for paired-end reads, tags and filtering of poor quality data. See the blockstring option for details.

The normal output from AYB is a sequence file written to the output option location (if any). The file format may be either fasta or fastq (option dataformat) and is named:

cif

{filename}[x].fasta/q

txt

{filename}[x]_seq.txt

The ‘x’ represents a, b, c … and is used only if multiple blocks are specified.

Program information messages, including errors, are written to stderr which can be redirected to a file in the standard way or through the logfile option.

OPTIONS

-A, --A <filepath>

Predetermined fixed Parameter A matrix file path. Format is a list of columns, one column per row with the first row containing the number of rows and columns (size 4 x ncycle by 4 x ncycle). Parameter A incorporates Crosstalk and Phasing. If supplied then Noise matrix file path must also be supplied. If not supplied then initially set from initial Crosstalk before estimation during modelling.

-b, --blockstring <Rn[InCn…]> [default: all in a single block]

How to group cycle data in intensity files for analysis, decoded as:

  • R ⇒ Read

  • I ⇒ Ignore

  • C ⇒ Concatenate onto previous block (first R must precede first C)

Some examples of use are:

  • R76R76 specifies two 76 cycle paired-end data blocks.

  • R50I26R50I26 specifies two 76 cycle paired-end data blocks trimmed so that only the first 50 bases of each are used e.g. because of poor quality.

  • R50I10C50 specifies a single 100 cycle block made up of two 50 cycle blocks separated by an unwanted 10 cycle block e.g. a tag.

-d, --dataformat <format> [default: cif]

Input format (cif/txt).

-e, --logfile <filepath> [default: none]

File path of message output (alternative to script redirect of error output). Program messages include information messages (selected options, input file processing, zero lambda count), errors and warnings.

-f, --format <format> [default: fastq]

Output format (fasta/fastq).

-g, --generr <num> [default: 38]

Generalised error value (higher value for higher quality scores).

-i, --input <path> [default: ""]

Location of input files. A prefix may also contain a full or partial path.

-k, --spikeuse

The spike-in data is used to calibrate the quality scores and the quality counts output file is not produced. Ignored if the spikein option is not selected.

-K, --spikein <path>

Location of spike-in data files. Files are named {filename}[x].spike (cif) or {filename}[x]_spike.txt (txt). Format is tab delimited with a cluster number and sequence on each line. The spike-in data is used to produce a table of counts of differences and observed for each quality value that can be used to calibrate quality scores. These are output as a tab delimited file with name {filename}[x].qspike (cif) or {filename}[x]_qspike.txt (txt). Quality scores are output without calibration unless the spikeuse option is selected.

-l, --loglevel <level> [default: warning]

Level of message output (none/fatal/error/information/warning/debug).

-M, --M <filepath>

Predetermined initial Crosstalk matrix file path. Format is a list of columns, one column per row with the first row containing the number of rows and columns (size 4 by 4). Used to initialise Parameter A matrix. If not supplied then a standard set of initial values are used. Ignored if fixed Parameter A and N matrices also supplied.

-n, --niter <num> [default: 5]

Number of model iterations.

-N, --N <filepath>

Predetermined fixed Noise matrix file path. Format is a list of columns, one column per row with the first row containing the number of rows and columns (size 4 by ncycle). If supplied then Parameter A matrix file path must also be supplied. If not supplied then initially set to zero before estimation during modelling.

-o, --output <path> [default: ""]

Location to create output files. Will be created if does not exist.

-p, --parallel <num> [default: 1]

Request multiple threads to speed up run time. Requesting more than available does not help performance.

-q, --noqualout

Do not output quality calibration table.

-Q, --qualtab <filepath>

Quality calibration table file path. Files are created in the correct format by the ayb_recal utility. Values used are output unless disabled (option noqualout). One output file per program run is created with name:

When option logfile used to redirect message output

{logname}.tab

Otherwise

ayb_xxxxxx_yymmdd_hhmm.tab where ‘xxxxxx’ is a random number string.

-r, --runfolder

Read cif files from a run-folder (supplied in the input option). The prefix is replaced by a lane tile (range) with format LnTn (Ln-nTn-n). An error will occur if cif input format is not selected. The run-folder sub-structure and filenames are prescribed as follows:

Sub-folder structure

/Data/Intensities/L00x/Cy.1/ where ‘x’ is the lane number and ‘y’ is the cycle number.

Single cycle filenames

s_x_z.cif where ‘x’ is the lane number and ‘z’ is the tile number.

Virtual intensities filename for output

s_x_zzzz where ‘x’ is the lane number and ‘zzzz’ is the tile number in 4 digits.

-s, --simdata <header>

Output simulation data as used by simNGS program (lambda fit and full covariance matrix). The header argument text is included in the file with limited interpretation. Spaces can be used if the whole thing is enclosed in double quotes ("). If quotes are required within the header then use either the double quote escape sequence (\") or single quotes ('). Use ANSI-C style bash quoting ($'…') to allow escape sequences such as newline (\n) to be interpreted. The output file name is {filename}.runfile (cif) or {filename}_runfile.txt (txt).

-t, --thin <factor> [default: 1]

Factor used to thin out clusters. Only every factor cluster, and those which are marked as spike-in data, will be used to estimate parameter values. The bases of all clusters will be called on the final iteration. Factor should be an integer greater than zero, with larger values decreasing runtime but potentially also decreasing accuracy.

-w, --working <level> [default: none]

Output final working values. All files up to a given level are created. Levels and files created are:

  • none

    No additional working will be output

    This is the default option.

  • matrices

    Parameter A and Noise matrices

    Format as predetermined matrix input. Filenames {filename}[x].A/N (cif) or {filename}[x]_A/N.txt (txt).

  • values

    Final model values

    Format as a collection of matrices. Filenames {filename}[x].final (cif) or {filename}[x]_final.txt (txt).

  • processed

    Final processed intensities

    Format as intensities input, cif or txt. Filenames {filename}[x].pif (cif) or {filename}[x]_pif.txt (txt).

-z, --zerothin <num> [default: 3]

Thin clusters with too many zero data cycles (those with num or more). See the thin option for details of the effects of thinning. Can be disabled by using a num larger than ncycle.

--help

Display this help.

--licence
--license

Display AYB licence information.

--version

Display AYB version information.

DIAGNOSTICS

Program Behaviour

AYB will issue an error message and stop if:

  • No prefix (or lane tile range) argument is supplied.

  • There is an error in the program options.

  • A predetermined input matrix or quality calibration table cannot be read.

  • A sequence or message file cannot be written to.

AYB will issue an error message and go on to the next prefix (or lane tile range) if:

  • There are no intensities files matching a prefix.

  • An intensities file does not contain enough cycles for the specified blockstring.

  • A lane tile range contains a syntax error.

AYB will issue an error message and go on to the next intensities file (or lane tile) if:

  • An intensities file cannot be read.

  • A run-folder lane tile does not exist.

AYB will issue an error message and go on to the next block if:

  • A predetermined input matrix is the wrong size.

  • The program runs out of memory to process.

FAQ

What is an ‘N’ base call?

‘N’ indicates that all the raw intensities for that cycle had value zero.

What causes a sequence to be all A’s with quality ‘!’?

Lambda has evaluated to zero for that cluster meaning base calls cannot be made. Zero lambda counts (if any) are shown in the message log. In newer versions base calls are made and zero lambdas are indicated by qualities being all set to ‘"’.

TO DO

Quality scores can be calibrated to be in line with empirical observations using a table. A default table is supplied and a description of how to adjust the table for local observations is to follow.

AUTHOR

Written by Hazel Marsden <hazelm@ebi.ac.uk> and Tim Massingham <tim.massingham@ebi.ac.uk>.

Contains the Non-Negative Least Squares routine of Charles L. Lawson and Richard J. Hanson (Jet Propulsion Laboratory, 1973). See http://www.netlib.org/lawson-hanson/ for details.

RESOURCES

COPYING

Copyright © 2010 European Bioinformatics Institute. Free use of this software is granted under the terms of the GNU General Public License (GPL). See the file COPYING in the AYB distribution or http://www.gnu.org/licenses/gpl.html for details.