 |
2Can Support Portal - Nucleotide AnalysisPairwise local/global alignment <<< 1/4 >>>
Pairwise local/global alignment - Introduction
The EMBOSS-Align tool can be used for pairwise alignment of sequences. There are three types of alignments we can consider when aligning sequnces, optimal, global and local alignments.
-
Optimal alignments - The alignment that is the best, given a defined set of rules and
parameter values for comparing different alignments. There is
no such thing as the single best alignment, since optimality
always depends on the assumptions one bases the alignment on. For
example, what penalty should gaps carry? All sequence alignment
procedures make some such assumptions.
- Global alignment - An alignment that assumes that the two proteins are basically similar
over the entire length of one another. The alignment attempts to match
them to each other from end to end, even though parts of the alignment
are not very convincing. A tiny example:
- Local alignment - An alignment that searches for segments of the two sequences that
match well. There is no attempt to force entire sequences into an
alignment, just those parts that appear to have good similarity,
according to some criterion. Using the same sequences as above, one
could get:
It may seem that one should always use local alignments. However, it
may be difficult to spot an overall similarlity, as opposed to just a
domain-to-domain similarity, if one uses only local alignment, so
global alignment is useful in some cases.
You can produce a global or a local alignment with the EMBOSS-Align pairwise global and local alignment tool.
EMBOSS-Align is the tool we will use to compare 2 sequences:
The EMBOSS-Align tool contains two programs each using a different algorithm:
- When you want an alignment that covers the whole length of both
sequences, use the needle program.
- When you are trying to find the best region of similarity between
two sequences, use the water program.
The following is a slightly more deatiled explanation of the two programs, needle and water, used in EMBOSS-Align:
- Needle program - This program uses the Needleman-Wunsch global alignment algorithm to
find the optimum alignment (including gaps) of two sequences when considering
their entire length.
The Needleman-Wunsch algorithm is a member of the class of algorithms
that can calculate the best score and alignment in the order of mn
steps, (where 'n' and 'm' are the lengths of the two sequences). These
dynamic programming algorithms were first developed for protein sequence
comparison by Needleman and Wunsch, though similar methods were independently
devised during the late 1960's and early 1970's for use in the fields
of speech processing and computer science.
What is the optimal alignment? Dynamic programming methods ensure
the optimal global alignment by exploring all possible alignments
and choosing the best. It does this by reading in a scoring matrix
that contains values for every possible residue or nucleotide match.
Needle finds an alignment with the maximum possible score where the
score of an alignment is equal to the sum of the matches taken from
the scoring matrix.
An important problem is the treatment of gaps, i.e., spaces inserted
to optimise the alignment score. A penalty is subtracted from the
score for each gap opened (the 'gap open' penalty) and a penalty is
subtracted from the score for the total number of gap spaces multiplied
by a cost (the 'gap extension' penalty).
Typically, the cost of extending a gap is set to be 5-10 times lower
than the cost for opening a gap.
This is a true implementation of the Needleman-Wunsch algorithm and so produces
a full path matrix. It therefore cannot be used with genome sized sequences
unless you've a lot of memory and a lot of time.
Needle is for aligning two sequences over their entire length. This
works best with closely related sequences. If you use needle to align
very distantly-related sequences, it will produce a result but much
of the alignment may have little or no biological significance.
A true Needleman Wunsch implementation like needle needs
memory proportional to the product of the sequence lengths. For two
sequences of length 10,000,000 and 1,000 it therefore needs memory
proportional to 10,000,000,000 characters. Two arrays of this size
are produced, one of integers and one of floats so multiply that figure
by 8 to get the memory usage in bytes. That doesn't include other
overheads. Therefore only use water and needle for accurate alignment
of reasonably short sequences.
- Water program - Water uses the Smith-Waterman algorithm (modified for speed enhancements)
to calculate the local alignment.
The Smith-Waterman algorithm is a member of the class of algorithms
that can calculate the best score and local alignment in the order
of mn steps, (where 'n' and 'm' are the lengths of the two sequences).
These dynamic programming algorithms were first developed for protein
sequence comparison by Smith and Waterman, though similar methods
were independently devised during the late 1960's and early 1970's
for use in the fields of speech processing and computer science.
A local alignment searches for regions of local similarity between
two sequences and need not include the entire length of the sequences.
Local alignment methods are very useful for scanning databases or
other circumstances when you wish to find matches between small regions
of sequences, for example between protein domains.
Dynamic programming methods ensure the optimal local alignment by
exploring all possible alignments and choosing the best. It does this
by reading in a scoring matrix that contains values for every possible
residue or nucleotide match. Water finds an alignment with
the maximum possible score where the score of an alignment is equal
to the sum of the matches taken from the scoring matrix.
An important problem is the treatment of gaps, i.e., spaces inserted
to optimise the alignment score. A penalty is subtracted from the
score for each gap opened (the 'gap open' penalty) and a penalty is
subtracted from the score for the total number of gap spaces multiplied
by a cost (the 'gap extension' penalty).
Typically, the cost of extending a gap is set to be 5-10 times lower
than the cost for opening a gap.
Water is a true implementation of the Smith-Waterman algorithm
and so produces a full path matrix. It therefore cannot be used
with genome sized sequences unless you have a lot of memory
and a lot of time.
Local alignment methods only report the best matching areas between
two sequences - there may be a large number of alternative local alignments
that do not score as highly. If two proteins share more than one common
region, for example one has a single copy of a particular domain while
the other has two copies, it may be possible to "miss" the second
and subsequent alignments. You will be able to see this situation
if you have done a dotplot and your local alignment does not show
all the features you expected to see.
Water is for aligning the best matching subsequences of two
sequences. It does not necessarily align whole sequences against each
other; you should use needle if you wish to align closely related
sequences along their whole lengths.
A true Smith Waterman implementation like water needs memory
proportional to the product of the sequence lengths. For two sequences
of length 10,000,000 and 1,000 it therefore needs memory proportional
to 10,000,000,000 characters. Two arrays of this size are produced,
one of integers and one of floating point (real) numbers so multiply
that figure by 8 to get the memory usage in bytes. That doesn't include
other overheads. Therefore only use water and needle for accurate
alignment of reasonably short sequences.
We will next look at an example of using the EMBOSS-Align tool >>> |
|
aaagcggaagtcacag ||.||.||||| |.|| aaggctgaagt-atag