 |
2Can Support Portal - About Gaps
Introduction
A gap is a maximal consecutive run of spaces in a single string of a given alignment. It corresponds to an atomic insertion or deletion of a substring.
Causes of gaps
- A single mutation can create a gap (very common).
- Unequal crossover in meiosis can lead to insertion or deletion of strings of bases.
- DNA slippage in the replication procedure can result in the repetition of a string.
- Retrovirus insertions.
- Translocations of DNA between chromosomes.
example of 2 aligned sequences:
Gaps can occur
- Before the first character of a string
- Inside a string
- After the last character of a string
Gap Penalties
Introduction of gaps into sequence alignments allows the alignment to be extended into regions where one sequence may have lost or gained sequence characters not found in the other. If the gap penalty is too low, then a high sequence alignment score is achievable even between unrelated or random sequences. A penalty is subtracted for each gap introduced into an alignment because the gap increases uncertainty into an alignment. If gaps are introduced without a penalty than they can be introduced at random and eventually all characters will be aligned in even random sequences.The gap penalty is used to help decide whether on not to accept a gap or insertion in an alignment when it is possible to achieve a good alignment residue-to-residue at some other neighbouring point in the sequence. One cannot let gaps/insertion occur without penalty, because an unreasonable 'gappy' alignment would result. Biologically, it should in general be easier for a protein to accept a different residue in a position, rather than having parts of the sequence chopped away or inserted. Gaps/insertions should therefore be more rare than point mutations (substitutions).
Thus, when aligning two sequences together it is often required to insert gaps in them in order to optimise the alignment. This can be done on the basis of identities alone, inserting gaps in the sequences as required where there are no matches. However, this is not recommended for biological sequence comparisons because similarities are then not taken into consideration. A scoring scheme, often referred to as a comparison matrix, is used which gives a high positive score when the identical residues or bases are properly aligned. Slightly less if a similarity is possible (i.e. a conservative substitution) and even negative scores for alignment pairs which are not biologically significant
When two sequences are aligned together a diagonal is created which depicts the best alignment path for these. This diagonal may be broken in places due to mismatches. If there are too many of these the diagonal is subdivided into several smaller ones. In order to make the alignment better gap initiation and gap extension penalties are introduced which penalise the total alignment score.
In general, the lower the gapping penalties, the more gaps and more identities are detected but this should be considered in relation to biological significance.
Adjusting gap penalties
FASTA, BLAST and ClustalW2 use slightly different terms to refer to gap initiation and gap extension penalties. In general, gapopen and opengap are the former while gapext and extendgap the latter.
Some of the later improvements to these programs include the possibility to penalise gaps separately on the database sequences and the query sequences separately. In ClustalW2, a gap penalty exists which penalises separately the length of a gap, closing a gap and the introduction of a pairwise gap in both sequences.
Gap penalty values are designed to reduce the score when an alignment has been broken by an insertion in one of the sequences. The value should be small enough to allow a previously accumulated alignment to continue with an insertion in one of the sequences but should not be so large that this previous alignment score is removed completely.
You could tweak gap open and gap extension penalties (which combined produce the overall gap penalty) in 2 ways:
e.g.Consider the following pair of sequences...
- Keep the score similar regardlass of gap length.
Allow a constant overall gap penalty regardless of gap length, in other words have a zero gap extension penalty and just penalise when you open a gap.
These types of penalty schemes assume that sequences are just as likely to change by large as by small insertions and deletions. This will penalise a large gap by the same extent as a small gap.
- Make the score becomes larger as a linear function of gap length:
Have a larger gap opening penalty followed by a gap extension penalty that is smaller than the gap open penalty.
This will penalise several small gaps by the same extent as 1 large gap.
|
Programs and Gaps
|
|