Pairwise sequence alignment
Pairwise alignment methods are used to compare two sequences.
Alignment types
- Global alignment:
- Aligns the full length of both sequences
- Used when the sequences are similar in size
- The classic algorithm used for global alignment is Needleman–Wunsch.
- Local alignment:
- Focuses on the best matching regions
- Used when sequences differ in length
- The classic algorithm used for local alignment is Smith–Waterman.
Scoring system
When we try to align sequences, we need a way to decide which matches are “good” and which differences matter. That’s where the scoring system comes in. The common scoring terminologies are as follows:
- Match score: Positive score for identical/similar characters
- Mismatch penalty: Negative score for differing characters
- Gap penalty: Negative score for introducing gaps (insertions or deletions) to optimise alignment. In pairwise sequence alignment, gap penalties are used to discourage the insertion of too many gaps. The gap opening penalty is the cost of starting a new gap, and the gap extension penalty is the smaller cost of lengthening an existing gap.
Substitution matrices
For protein sequences, not all mismatches are equal. Some amino acids are more likely to substitute for each other than others. That’s why substitution matrices like PAM or BLOSUM are used to assign more realistic scores.
- PAM (Point Accepted Mutation) matrices
- Based on evolutionary models of accepted mutations in closely related proteins
- Examples: PAM30, PAM250 (30 or 250 refers to the evolutionary distance, specifically representing 30 OR 250 accepted mutations per 100 amino acids)
- Lower number = short evolutionary distance (more similar sequences)
- Higher number = more divergent sequences
- BLOSUM (Blocks Substitution Matrix)
- Derived from conserved blocks in protein families
- Examples: BLOSUM62, BLOSUM80, BLOSUM45 (The numbers 62, 80, or 45 represent the minimum percentage identity threshold of the aligned protein sequences used to calculate the matrix)
- Higher number = used for similar sequences
- Lower number = used for more divergent sequences
Dynamic programming
Once we have these scores, we need a systematic way to find the best alignment. That’s where dynamic programming (Needleman–Wunsch for global alignment, Smith–Waterman for local alignment) comes in: it calculates the highest scoring alignment by considering matches, mismatches, and gaps across the sequences.
After getting familiar with the pairwise alignment, continue to the next page to explore multiple sequence alignment, which lets you compare several sequences at the same time.