Help - About Gaps In Sequence Alignments


Introduction

A gap is a maximal consecutive run of spaces in a single string of a given alignment. It corresponds to an atomic insertion or deletion of a substring.

Causes of gaps



example of 2 aligned sequences:



ATCTTCAGTGTTTCCCCAACCTGTTTTGCGCC..AGCCTTTCAGTTCCGCTTCTATTTTCTCAATCGCGCCGC
|||||||||||||| ||   ||||||||| ||  |   ||| ||||| || || | ||| || || || || |
ATCTTCAGTGTTTCGCCTGTCTGTTTTGCACCGGAATTTTTGAGTTCTGCCTCGAGTTTATCGATAGCCCCAC


Gaps can occur


Gap Penalties

Introduction of gaps into sequence alignments allows the alignment to be extended into regions where one sequence may have lost or gained sequence characters not found in the other. If the gap penalty is too low, then a high sequence alignment score is achievable even between unrelated or random sequences. A penalty is subtracted for each gap introduced into an alignment because the gap increases uncertainty into an alignment. If gaps are introduced without a penalty than they can be introduced at random and eventually all characters will be aligned in even random sequences.The gap penalty is used to help decide whether on not to accept a gap or insertion in an alignment when it is possible to achieve a good alignment residue-to-residue at some other neighbouring point in the sequence. One cannot let gaps/insertion occur without penalty, because an unreasonable 'gappy' alignment would result. Biologically, it should in general be easier for a protein to accept a different residue in a position, rather than having parts of the sequence chopped away or inserted. Gaps/insertions should therefore be more rare than point mutations (substitutions).

Thus, when aligning two sequences together it is often required to insert gaps in them in order to optimise the alignment. This can be done on the basis of identities alone, inserting gaps in the sequences as required where there are no matches. However, this is not recommended for biological sequence comparisons because similarities are then not taken into consideration. A scoring scheme, often referred to as a comparison matrix, is used which gives a high positive score when the identical residues or bases are properly aligned. Slightly less if a similarity is possible (i.e. a conservative substitution) and even negative scores for alignment pairs which are not biologically significant

When two sequences are aligned together a diagonal is created which depicts the best alignment path for these. This diagonal may be broken in places due to mismatches. If there are too many of these the diagonal is subdivided into several smaller ones. In order to make the alignment better gap initiation and gap extension penalties are introduced which penalise the total alignment score.

In general, the lower the gapping penalties, the more gaps and more identities are detected but this should be considered in relation to biological significance.

Adjusting gap penalties

FASTA, BLAST and ClustalW use slightly different terms to refer to gap initiation and gap extension penalties. In general, gapopen and opengap are the former while gapext and extendgap the latter.

Some of the later improvements to these programs include the possibility to penalise gaps separately on the database sequences and the query sequences separately. In ClustalW2, a gap penalty exists which penalises separately the length of a gap, closing a gap and the introduction of a pairwise gap in both sequences.

Gap penalty values are designed to reduce the score when an alignment has been broken by an insertion in one of the sequences. The value should be small enough to allow a previously accumulated alignment to continue with an insertion in one of the sequences but should not be so large that this previous alignment score is removed completely.

You could tweak gap open and gap extension penalties (which combined produce the overall gap penalty) in 2 ways:

e.g.Consider the following pair of sequences...

A sequence with a short gap:


ATCTTCAGTGTTTCCCCTGTTTTGCCC.ATTTAGTTCGCTC
||||||||||||||||||||||||||| |||||||||||||
ATCTTCAGTGTTTCCCCTGTTTTGCCCGATTTAGTTCGCTC

A sequence with a long gap:

ATCTTCAGTGTTTCCCCTGTTTTGCCC....................ATTTAGTTCGCTC
|||||||||||||||||||||||||||                    |||||||||||||
ATCTTCAGTGTTTCCCCTGTTTTGCCCGCCCCCCCCCCCCCCCCCCCATTTAGTTCGCTC


Programs and Gaps

Utility Details
FASTA3, BLAST2 and ClustalW2. GAPOPEN or OPENGAP or OPEN GAP PENALTY : Penalty for the first residue in a gap
(e.g.fasta defaults: -12 by with proteins, -16 for DNA).

GAPEXT or EXTENDGAP or EXTEND GAP PENALTY : Penalty for additional residues in a gap
(e.g. fasta defaults: -2 with proteins, -4 for DNA).
ClustalW2

ENDGAP: Penalty for closing a gap.

GAPDIST: Penalty for gap separation.

PAIRGAP: Penalty for generating pairwise gaps

NCBI-BLAST2, BLAST2 EVEC. OPENGAP: The gap open penalty is the score taken away for the initiation of the gap in sequence or in structure. To make the match more significant you can try to make the gap penalty larger. It will decrease the number of gaps and if you have good alignment without many gaps, its Z-score will be higher.

EXTENDGAP: The gap extension penalty is added to the standard gap penalty for each base or residue in the gap. This is how long gaps are penalised. If you don't like long gaps, just increase the extension gap penalty. Usually you will expect a few long gaps rather than many short gaps, so the gap extension penalty should be lower than the gap penalty. An exception is where one or both sequences are single reads with possible sequencing errors in which case you would expect many single base gaps. You can get this result by setting the gap open penalty to zero (or very low) and using the gap extension penalty to control gap scoring.

GAPALIGN: This is a y/n (true/none) that tells the program to perform optimised alignments within regions involving GAPS. If set to TRUE the program will perform an alignment using GAPS. Otherwise it will report only individual HSP where two sequences match each other.