SYNOPSIS

AYB [-b blockstring] [-c composition] [-d input format] [-e log file] [-f output format] [-i input path] [-l log level] [-m mu] [-n iterations] [-o output path] [-q] [-r] [-s header] [-w] [-M Crosstalk] [-N Noise] [-P Phasing] [-Q quality tab] [-S Solver] prefix[+]/lane tile range [prefix[+]/lane tile range …]

AYB --help

AYB --licence

AYB --license

AYB --version

EXAMPLE

AYB intensities

will process cif file intensities in one block using 5 iterations and output a fastq file, both in the current directory with log messages to stderr.

DESCRIPTION

AYB is an advanced basecaller for the Illumina sequencing platform, producing basecalls and associated quality measures from raw intensity information.

AYB selects intensity files using the input option location (if any) and command line prefix arguments supplied. A prefix may also contain a full or partial path. If a prefix is followed by a ‘+’ then it is treated as a prefix, else the file match is exact.

Raw intensities can be either cif or standard illumina (txt) format. AYB looks for files matching one of the following templates:

cif

{prefix}[*].cif

txt

{prefix}[*]_int.txt*[.{zipext}]

If cif is selected then intensities may alternatively be located in multiple files in a run-folder. See the runfolder option for details.

The name of an intensities file without the extension (cif) or the part of the name up to the ‘_int’ (txt) will be referred to elsewhere as the ‘filename’.

AYB can process an intensities file as a single block or be instructed to group the data by cycle into multiple blocks and process separately. This allows for paired-end reads, tags and filtering of poor quality data. See the blockstring option for details.

The normal output from AYB is a sequence file written to the output option location (if any). The file format may be either fasta or fastq (option dataformat) and is named:

cif

{filename}[x].fasta/q

txt

{filename}[x]_seq.txt

The ‘x’ represents a, b, c … and is used only if multiple blocks are specified.

Program information messages, including errors, are written to stderr which can be redirected to a file in the standard way or through the logfile option.

OPTIONS

-b, --blockstring <Rn[InCn…]> [default: all in a single block]

How to group cycle data in intensity files for analysis, decoded as:

  • R ⇒ Read

  • I ⇒ Ignore

  • C ⇒ Concatenate onto previous block (first R must precede first C)

-c, --composition <proportion GC> [default: 0.5]

The GC content of the material being sequenced, for use as a prior when calling bases. The default setting is equivalent to an equal prior on all bases. The composition should be a proportion strictly between zero and one.

-d, --dataformat <format> [default: cif]

Input format (cif/txt).

-e, --logfile <filepath> [default: none]

File path of message output (alternative to script redirect of error output). Program messages include information messages (selected options, input file processing, zero lambda count), errors and warnings.

-f, --format <format> [default: fastq]

Output format (fasta/fastq).

-i, --input <path> [default: ""]

Location of input files. A prefix may also contain a full or partial path.

-l, --loglevel <level> [default: warning]

Level of message output (none/fatal/error/warning/information/debug).

-m, --mu <num> [default: 1.0E-5]

Adjust range of quality scores (smaller value for higher maximum quality score).

-M, --M <filepath>

Predetermined Crosstalk matrix file path. Format is a list of columns, one column per row with the first row containing the number of rows and columns (size 4 x 4). If not supplied then a standard set of initial values are used.

-n, --niter <num> [default: 5]

Number of model iterations.

-N, --N <filepath>

Predetermined Noise matrix file path. Format is a list of columns, one column per row with the first row containing the number of rows and columns (size 4 x ncycle). If not supplied then initially set to zero.

-o, --output <path> [default: ""]

Location to create output files. Will be created if does not exist.

-P, --P <filepath>

Predetermined Phasing matrix file path. Format is a list of columns, one column per row with the first row containing the number of rows and columns (size ncycle x ncycle). If not supplied then initially set to identity.

-q, --noqualout

Do not output quality calibration table.

-Q, --qualtab <filepath>

Quality calibration table file path. Files are created in the correct format by the ayb_recal utility. Values used are output unless disabled (option noqualout). One output file per program run is created with name:

When option logfile used to redirect message output

{logname}.tab

Otherwise

ayb_xxxxxx_yymmdd_hhmm.tab where ‘xxxxxx’ is a random number string.

-r, --runfolder

Read cif files from a run-folder (supplied in the input option). The prefix is replaced by a lane tile (range) with format LnTn (Ln-nTn-n). An error will occur if cif input format is not selected. The run-folder sub-structure and filenames are prescribed as follows:

Sub-folder structure

/Data/Intensities/L00x/Cy.1/ where ‘x’ is the lane number and ‘y’ is the cycle number.

Single cycle filenames

s_x_z.cif where ‘x’ is the lane number and ‘z’ is the tile number.

Virtual intensities filename for output

s_x_zzzz where ‘x’ is the lane number and ‘zzzz’ is the tile number in 4 digits.

-s --simdata <header>

Output simulation data as used by simNGS program (lambda fit and full covariance matrix). The header argument text is included in the file with limited interpretation. Spaces can be used if the whole thing is enclosed in double quotes ("). If quotes are required within the header then use either the double quote escape sequence (\") or single quotes ('). Use ANSI-C style bash quoting ($'…') to allow escape sequences such as newline (\n) to be interpreted. The output file name is {filename}.runfile (cif) or {filename}_runfile.txt (txt).

-S, --solver <solver> [default zero]

Linear equation solver to use for P matrix. Options are:

  • ls least squares, allow negatives.

  • zero least squares then set negatives to zero.

  • nnls non-negative least squares.

-w, --working

Output final working values. Files created are:

Final processed intensities

Format as intensities input, cif or txt. Filenames {filename}[x].pif (cif) or {filename}[x]_pif.txt (txt).

Final model values

Format as a collection of matrices. Filenames {filename}[x].final (cif) or {filename}[x]_final.txt (txt).

Crosstalk, Noise and Phasing matrices

Format as predetermined matrix input. Filenames {filename}[x].M/N/P (cif) or {filename}[x]_M/N/P.txt (txt).

--help

Display this help.

--licence
--license

Display AYB licence information.

--version

Display AYB version information.

DIAGNOSTICS

Program Behaviour

AYB will issue an error message and stop if:

  • No prefix (or lane tile range) argument is supplied.

  • There is an error in the program options.

  • A predetermined input matrix or quality calibration table cannot be read.

  • A sequence or message file cannot be written to.

AYB will issue an error message and go on to the next prefix (or lane tile range) if:

  • There are no intensities files matching a prefix.

  • An intensities file does not contain enough cycles for the specified blockstring.

  • A lane tile range contains a syntax error.

  • A predetermined input matrix is the wrong size.

  • The program runs out of memory to process.

AYB will issue an error message and go on to the next intensities file (or lane tile) if:

  • An intensities file cannot be read.

  • A run-folder lane tile does not exist.

FAQ

What is an ‘N’ base call?

‘N’ indicates that all the raw intensities for that cycle had value zero.

What causes a sequence to be all A’s with quality ‘!’?

Lambda has evaluated to zero for that cluster meaning base calls cannot be made. Zero lambda counts (if any) are shown in the message log.

TO DO

Quality scores are calibrated to be in line with empirical observations using a table. A default table is supplied and a description of how to adjust the table for local observations is to follow.

AUTHOR

Written by Hazel Marsden <hazelm@ebi.ac.uk> and Tim Massingham <tim.massingham@ebi.ac.uk>.

Contains the Non-Negative Least Squares routine of Charles L. Lawson and Richard J. Hanson (Jet Propulsion Laboratory, 1973). See http://www.netlib.org/lawson-hanson/ for details.

RESOURCES

COPYING

Copyright © 2010 European Bioinformatics Institute. Free use of this software is granted under the terms of the GNU General Public License (GPL). See the file COPYING in the AYB distribution or http://www.gnu.org/licenses/gpl.html for details.