Multiple sequence alignment
Multiple sequence alignment (MSA) is an extension of pairwise sequence alignment, used to align three or more biological sequences that can indicate important functional or structural elements. MSAs help infer evolutionary relationships by showing how sequences have diverged from a common ancestor. Aligning sequences with known functions helps to infer the function of unknown sequences.
Approaches
- Progressive Alignment: Builds the alignment by first aligning the most similar sequences and progressively adding more divergent sequences. Examples: ClustalW, Clustal Omega
- Iterative Refinement: Starts with an initial alignment and iteratively refines it to improve the overall score. Example: MUSCLE, MAFFT
- Consistency-Based Methods: Use a combination of pairwise alignments to build a consensus alignment. Example: T-Coffee
- Hidden Markov Models (HMMs): Statistical models that consider the probability of sequence substitutions and indels (i.e. insertions and deletions). Example: HMMER
Scoring systems
The scoring approach is similar to that used in pairwise alignment, but adapted for multiple sequences:
- Conservation of residues is given more importance, and gap penalties can vary to account for evolutionary events.
- Sum-of-Pairs (SP) score is commonly used to compute a final MSA score, which is derived from the sum of all pairwise scores across all columns (i.e. aligned positions in the MSA). Example: ClustalW, Clustal Omega, MUSCLE and MAFFT.
- Probability scores, such as bit-score and E-values (more information on E-values is available on the following page), are used in probabilistic approaches, such as the HMMs, mentioned above. The final score represents the likelihood of observing the final MSA given the underlying model. Example: HMMER.
- Consistency-based and weighted scores are employed to avoid bias from highly similar sequences in the final MSA, and typically measure the internal consistency of the final MSA. Example: T-Coffee.
Visualisation
- MSA results are often displayed in a matrix format where each row represents a sequence and each column represents aligned residues or gaps.
- Tools like Jalview or MView are used for visualising and interpreting MSA results.
Challenges
- Complexity: The computational complexity increases exponentially with the number of sequences and their lengths.
- Optimal alignment: Finding the optimal alignment is more difficult due to the need to balance the placement of gaps and mismatches across multiple sequences.
After learning about aligning sequences, move on to the next step to find out whether your sequence appears elsewhere. This is where sequence similarity searching becomes useful.