Help - About Gaps In Sequence Alignments
Introduction
A gap is a maximal consecutive run of spaces in a single string of a given alignment. It corresponds to an atomic insertion or deletion of a substring.
Causes of gaps
- A single mutation can create a gap (very common).
- Unequal crossover in meiosis can lead to insertion or deletion of strings of bases.
- DNA slippage in the replication procedure can result in the repetition of a string.
- Retrovirus insertions.
- Translocations of DNA between chromosomes.
example of 2 aligned sequences:
Gaps can occur
- Before the first character of a string
e.g. CTGCGGG---GGTAAT |||| || || --GCGG-AGAGG-AA- - Inside a string
e.g. CTGCGGG---GGTAAT |||| || || --GCGG-AGAGG-AA- - After the last character of a string
e.g. CTGCGGG---GGTAAT |||| || || --GCGG-AGAGG-AA-
Gap Penalties
Introduction of gaps into sequence alignments allows the alignment to be extended into regions where one sequence may have lost or gained sequence characters not found in the other. If the gap penalty is too low, then a high sequence alignment score is achievable even between unrelated or random sequences. A penalty is subtracted for each gap introduced into an alignment because the gap increases uncertainty into an alignment. If gaps are introduced without a penalty than they can be introduced at random and eventually all characters will be aligned in even random sequences.The gap penalty is used to help decide whether on not to accept a gap or insertion in an alignment when it is possible to achieve a good alignment residue-to-residue at some other neighbouring point in the sequence. One cannot let gaps/insertion occur without penalty, because an unreasonable 'gappy' alignment would result. Biologically, it should in general be easier for a protein to accept a different residue in a position, rather than having parts of the sequence chopped away or inserted. Gaps/insertions should therefore be more rare than point mutations (substitutions).
Thus, when aligning two sequences together it is often required to insert gaps in them in order to optimise the alignment. This can be done on the basis of identities alone, inserting gaps in the sequences as required where there are no matches. However, this is not recommended for biological sequence comparisons because similarities are then not taken into consideration. A scoring scheme, often referred to as a comparison matrix, is used which gives a high positive score when the identical residues or bases are properly aligned. Slightly less if a similarity is possible (i.e. a conservative substitution) and even negative scores for alignment pairs which are not biologically significant
When two sequences are aligned together a diagonal is created which depicts the best alignment path for these. This diagonal may be broken in places due to mismatches. If there are too many of these the diagonal is subdivided into several smaller ones. In order to make the alignment better gap initiation and gap extension penalties are introduced which penalise the total alignment score.
In general, the lower the gapping penalties, the more gaps and more identities are detected but this should be considered in relation to biological significance.
Adjusting gap penalties
FASTA, BLAST and ClustalW use slightly different terms to refer
to gap initiation and gap extension penalties. In general, gapopen and
opengap are the former while gapext and extendgap the latter.
Some of the later improvements to these programs include the possibility
to penalise gaps separately on the database sequences and the query sequences
separately. In ClustalW2, a gap penalty exists
which penalises separately the length of a gap, closing a gap and the introduction
of a pairwise gap in both sequences.
Gap penalty values are designed to reduce the score when an alignment has been broken by an insertion in one of the sequences. The value should be small enough to allow a previously accumulated alignment to continue with an insertion in one of the sequences but should not be so large that this previous alignment score is removed completely.
You could tweak gap open and gap extension penalties (which combined produce the overall gap penalty) in 2 ways:
A sequence with a short gap:
ATCTTCAGTGTTTCCCCTGTTTTGCCC.ATTTAGTTCGCTC
||||||||||||||||||||||||||| |||||||||||||
ATCTTCAGTGTTTCCCCTGTTTTGCCCGATTTAGTTCGCTC
A sequence with a long gap:
ATCTTCAGTGTTTCCCCTGTTTTGCCC....................ATTTAGTTCGCTC
||||||||||||||||||||||||||| |||||||||||||
ATCTTCAGTGTTTCCCCTGTTTTGCCCGCCCCCCCCCCCCCCCCCCCATTTAGTTCGCTC
- Keep the score similar regardlass of gap length.
Allow a constant overall gap penalty regardless of gap length, in other words have a zero gap extension penalty and just penalise when you open a gap.
These types of penalty schemes assume that sequences are just as likely to change by large as by small insertions and deletions. This will penalise a large gap by the same extent as a small gap.
- Make the score becomes larger as a linear function of gap length:
Have a larger gap opening penalty followed by a gap extension penalty that is smaller than the gap open penalty.
This will penalise several small gaps by the same extent as 1 large gap.
| Utility | Details |
|---|---|
| FASTA3, BLAST2 and ClustalW2. | GAPOPEN or OPENGAP or OPEN GAP PENALTY : Penalty for the first residue
in a gap (e.g.fasta defaults: -12 by with proteins, -16 for DNA). GAPEXT or EXTENDGAP or EXTEND GAP PENALTY : Penalty for additional residues in a gap (e.g. fasta defaults: -2 with proteins, -4 for DNA). |
| ClustalW2 | ENDGAP: Penalty for closing a gap. |
| NCBI-BLAST2, BLAST2 EVEC. | OPENGAP: The gap open penalty is the score taken away for the initiation of the gap in sequence or in structure. To make the match more significant you can try to make the gap penalty larger. It will decrease the number of gaps and if you have good alignment without many gaps, its Z-score will be higher. EXTENDGAP: The gap extension penalty is added to the standard gap penalty for each base or residue in the gap. This is how long gaps are penalised. If you don't like long gaps, just increase the extension gap penalty. Usually you will expect a few long gaps rather than many short gaps, so the gap extension penalty should be lower than the gap penalty. An exception is where one or both sequences are single reads with possible sequencing errors in which case you would expect many single base gaps. You can get this result by setting the gap open penalty to zero (or very low) and using the gap extension penalty to control gap scoring. GAPALIGN: This is a y/n (true/none) that tells the program to perform optimised alignments within regions involving GAPS. If set to TRUE the program will perform an alignment using GAPS. Otherwise it will report only individual HSP where two sequences match each other. |
ATCTTCAGTGTTTCCCCAACCTGTTTTGCGCC..AGCCTTTCAGTTCCGCTTCTATTTTCTCAATCGCGCCGC |||||||||||||| || ||||||||| || | ||| ||||| || || | ||| || || || || | ATCTTCAGTGTTTCGCCTGTCTGTTTTGCACCGGAATTTTTGAGTTCTGCCTCGAGTTTATCGATAGCCCCAC