spacer
spacer

2Can Support Portal - Nucleotide Analysis


Pairwise local/global alignment - Introduction


The EMBOSS-Align tool can be used for pairwise alignment of sequences. There are three types of alignments we can consider when aligning sequnces, optimal, global and local alignments.
  • Optimal alignments - The alignment that is the best, given a defined set of rules and parameter values for comparing different alignments. There is no such thing as the single best alignment, since optimality always depends on the assumptions one bases the alignment on. For example, what penalty should gaps carry? All sequence alignment procedures make some such assumptions.

  • Global alignment - An alignment that assumes that the two proteins are basically similar over the entire length of one another. The alignment attempts to match them to each other from end to end, even though parts of the alignment are not very convincing. A tiny example:


              aaagcggaagtcacag
              ||.||.||||| |.||                     
              aaggctgaagt-atag

  • Local alignment - An alignment that searches for segments of the two sequences that match well. There is no attempt to force entire sequences into an alignment, just those parts that appear to have good similarity, according to some criterion. Using the same sequences as above, one could get:


              aaagcggaagtcacag
              ......||||| ....                      
              aaggctgaagt-atag

    It may seem that one should always use local alignments. However, it may be difficult to spot an overall similarlity, as opposed to just a domain-to-domain similarity, if one uses only local alignment, so global alignment is useful in some cases. You can produce a global or a local alignment with the EMBOSS-Align pairwise global and local alignment tool.


EMBOSS-Align is the tool we will use to compare 2 sequences:



The EMBOSS-Align tool contains two programs each using a different algorithm:
  • When you want an alignment that covers the whole length of both sequences, use the needle program.
  • When you are trying to find the best region of similarity between two sequences, use the water program.

The following is a slightly more deatiled explanation of the two programs, needle and water, used in EMBOSS-Align:
  • Needle program - This program uses the Needleman-Wunsch global alignment algorithm to find the optimum alignment (including gaps) of two sequences when considering their entire length.

    The Needleman-Wunsch algorithm is a member of the class of algorithms that can calculate the best score and alignment in the order of mn steps, (where 'n' and 'm' are the lengths of the two sequences). These dynamic programming algorithms were first developed for protein sequence comparison by Needleman and Wunsch, though similar methods were independently devised during the late 1960's and early 1970's for use in the fields of speech processing and computer science.

    What is the optimal alignment? Dynamic programming methods ensure the optimal global alignment by exploring all possible alignments and choosing the best. It does this by reading in a scoring matrix that contains values for every possible residue or nucleotide match. Needle finds an alignment with the maximum possible score where the score of an alignment is equal to the sum of the matches taken from the scoring matrix.

    An important problem is the treatment of gaps, i.e., spaces inserted to optimise the alignment score. A penalty is subtracted from the score for each gap opened (the 'gap open' penalty) and a penalty is subtracted from the score for the total number of gap spaces multiplied by a cost (the 'gap extension' penalty).

    Typically, the cost of extending a gap is set to be 5-10 times lower than the cost for opening a gap.

    This is a true implementation of the Needleman-Wunsch algorithm and so produces a full path matrix. It therefore cannot be used with genome sized sequences unless you've a lot of memory and a lot of time.

    Needle is for aligning two sequences over their entire length. This works best with closely related sequences. If you use needle to align very distantly-related sequences, it will produce a result but much of the alignment may have little or no biological significance.

    A true Needleman Wunsch implementation like needle needs memory proportional to the product of the sequence lengths. For two sequences of length 10,000,000 and 1,000 it therefore needs memory proportional to 10,000,000,000 characters. Two arrays of this size are produced, one of integers and one of floats so multiply that figure by 8 to get the memory usage in bytes. That doesn't include other overheads. Therefore only use water and needle for accurate alignment of reasonably short sequences.



  • Water program - Water uses the Smith-Waterman algorithm (modified for speed enhancements) to calculate the local alignment.

    The Smith-Waterman algorithm is a member of the class of algorithms that can calculate the best score and local alignment in the order of mn steps, (where 'n' and 'm' are the lengths of the two sequences). These dynamic programming algorithms were first developed for protein sequence comparison by Smith and Waterman, though similar methods were independently devised during the late 1960's and early 1970's for use in the fields of speech processing and computer science.

    A local alignment searches for regions of local similarity between two sequences and need not include the entire length of the sequences. Local alignment methods are very useful for scanning databases or other circumstances when you wish to find matches between small regions of sequences, for example between protein domains.

    Dynamic programming methods ensure the optimal local alignment by exploring all possible alignments and choosing the best. It does this by reading in a scoring matrix that contains values for every possible residue or nucleotide match. Water finds an alignment with the maximum possible score where the score of an alignment is equal to the sum of the matches taken from the scoring matrix.

    An important problem is the treatment of gaps, i.e., spaces inserted to optimise the alignment score. A penalty is subtracted from the score for each gap opened (the 'gap open' penalty) and a penalty is subtracted from the score for the total number of gap spaces multiplied by a cost (the 'gap extension' penalty).

    Typically, the cost of extending a gap is set to be 5-10 times lower than the cost for opening a gap.

    Water is a true implementation of the Smith-Waterman algorithm and so produces a full path matrix. It therefore cannot be used with genome sized sequences unless you have a lot of memory and a lot of time.

    Local alignment methods only report the best matching areas between two sequences - there may be a large number of alternative local alignments that do not score as highly. If two proteins share more than one common region, for example one has a single copy of a particular domain while the other has two copies, it may be possible to "miss" the second and subsequent alignments. You will be able to see this situation if you have done a dotplot and your local alignment does not show all the features you expected to see. Water is for aligning the best matching subsequences of two sequences. It does not necessarily align whole sequences against each other; you should use needle if you wish to align closely related sequences along their whole lengths.

    A true Smith Waterman implementation like water needs memory proportional to the product of the sequence lengths. For two sequences of length 10,000,000 and 1,000 it therefore needs memory proportional to 10,000,000,000 characters. Two arrays of this size are produced, one of integers and one of floating point (real) numbers so multiply that figure by 8 to get the memory usage in bytes. That doesn't include other overheads. Therefore only use water and needle for accurate alignment of reasonably short sequences.


We will next look at an example of using the EMBOSS-Align tool >>>


<<< Previous || Start of Lesson || Next >>>

spacer
spacer