 |
2Can Support Portal - Nucleotide AnalysisFASTA similarity search <<< 1/8 >>>
FASTA similarity search - Introduction
FASTA (pronounced FAST-AYE) stands for FAST-ALL, reflecting the fact that it can be used for a fast protein comparison or a fast nucleotide comparison. This program achieves a high level of sensitivity for similarity searching at high speed. This is achieved by performing optimised searches for local alignments using a substitution matrix, in this case a DNA identity matrix.
The high speed of this program is achieved by using the observed pattern of word hits to identify potential matches before attempting the more time consuming optimised search. The trade-off between speed and sensitivity is controlled by the ktup parameter, which specifies the size of the word. Increasing the ktup decreases the number of background hits. Not every word hit is investigated but instead it initially looks for segment's containing several nearby hits. This program is much more sensitive than BLAST programs, which is reflected by the length of time required to produce results. FASTA produces optimal local alignment scores for the comparison of the query sequence to every sequence in the database. The majority of these scores involve unrelated sequences, and therefore can be used to estimate lambda and K values. These are statistical parameters estimated from the distribution of unrelated sequence similarity scores. This approach avoids the artificiality of a random sequence model by employing real sequences, with their natural correlations.
FASTA uses four steps to calculate three scores that characterise sequence
similarity. These steps are outlined below. A representation of these steps
is reported in a postscript format figure drawn from
Barton (1994) Protein Sequence Alignment and Database Scanning.
Step 1 : Identify regions shared by the two sequences with the highest
density of identities (ktup=1) or pairs of identities (ktup=2).
The first step uses a rapid technique for finding identities shared
between two sequences; the method is similar to an earlier technique
described by Wilbur and Lipman.
FASTA achieves much of its speed and selectivity in this step by using
a lookup table to locate all identities or groups of identities between
two DNA or amino acid sequences during the first step of the comparison.
The ktup parameter determines how many consecutive identities are
required in a match. A ktup value of 2 is frequrntly used for protein
sequence comparison, which means that the program examines only those
portions of the two sequences being compared that have at least two
adjacent identical residues in both sequences. More sensitive searches
can be done using ktup = 1. For DNA sequence comparisons, the ktup
parameter can range from 1 to 6; values between 4 and 6 are recommanded.
When the query sequence is a short oliginucleotide of oligopeptude, ktup = 1
should be used.
In conjunction with the lookup table, we use the "diagonal" method to find
all regions of similarity between the two sequences, counting ktup matches
and penalizing for intervening mismatches. This method identified regions
of a diagonal that have the highest densitu of ktup matches. The term
diagonal refers to the diagonal line that is seen on a dot matrix plot
when a sequence is compared with itself, and it denotes an alignment between
two sequenves without gaps.
FASTA uses a formula for scoring ktup matches that incorporates the actual
PAM250 values for the aligned residues. Thus, groups of identities with high
similarity scores contribute more to the local diagonal score than to
identities with low similarity scores.
This more sensitive formula is used for protein sequence comparisons;
the constant value for ktup matches is used for DNA sequence comparisons.
FASTA saves the 10 best local regions, regardless of whether they are on
the same of different diagonals.
Step 2 : Rescan the 10 regions with the highest density of identities using
the PAM250 matrix. Trim the ends of the region to include only those
residues
contributing to the highest score. Each region is a partial alignment
without
gaps.
After the 10 best local regions are found in the first step, they are
rescored using a scoring matrix that allows runs of identities shorter than
ktup residues and conservative replacements to contribute to the similarity
score. For protein sequences, this score is usually caculated using the
PAM250
matrix, although scoring matrices based on the minimum number of base
changes
required for a specific replacement, on identities alone, or on an
alternative
measure of similarity, can also be used with FASTA. The PAM250 scoring
matrix
was derived from the analysis of the amino acid replacements occuring among
related proteins, and it specifies a range of positive scores for
replacements
that commonly occur among related proteins and negative scores for unlikely
replacements.
FASTA can also be used for DNA sequence comparisons, and matrices can be
constructed that allow separate penalties for transitions and transversions.
For each of the best diagonal regions rescanned with the scoring matrix, a
subregion with the maximal score is identified.
Initial scores are used to rank the library sequences. These scores are
referred to as init1 score.
Step 3 : If there are several initial regions with scores greater than the
CUTOFF value, check to see whether the trimmed initial regions can be joined
to form an approximate alignment with gaps. Calculate a similarity score
that
is the sum of the joined initial regions minus a penalty (usually 20) for
each
gap. This initial similarity score (initn) is used to rank the library
sequences. The score of the single best initial region found in step 2 is
reported (init1).
FASTA checks, during a library search, to see whether several initial
regions
can be joined together in a single alignment to increase the initial score.
FASTA calculates an optimal alignment of initial regions as a combination of
compatible regions with maximal score. This optimal alignment of initial
regions can be rapidily calculated using a dynamic programming algorithm.
FASTA uses the resulting score, referred to as the initn score, to rank
the library sequences.
The third "joining" step in the computation of the initial score increases
the
sensitivity of the search method because it allows for insertions and
deletions
as well as conservative replacements. The modification does, however,
decrease
selectivity. The degradation selectivity is limited by including in the
optimization step only those initial regions whose scores are above an
empirically determined threshold : FASTA joins an initial region only if its
similarity score is greater than the cutoff value, a value that is
approximately one standard deviation above the average score expected from
unrelated sequences in the library. For a 200-residue query sequence and
ktup-2, this value is 28.
Step 4 : constructs NWS (Needleman-Wunch-Sellers algorithm) optimal
alignment
of the query sequence and the library sequence, considering only those
residues
that lie in a band 32 residues wide centered on the best initial region
found
in Step 2. FASTA reports this score as the optimized (opt) score.
After a complete search of the library, FASTA plots the initial scores of
each
library sequence in a histogram, calculates the mean similarity score for
the
query sequence against each sequence in the library, and determines the
standard deviation of the distribution of initial scores. The initial scores
are used to rank the library sequences, and, in the fourth and final step of
the comparison, the highest scoring library sequences are aligned using a
modification of the standard NWS optimization method. The optimization
employs
the same scoring matrix used in determining the initial regions; the
resulting
optimized alignments are calculated for further analysis of potential
relationships, and the optimized similarity score is reported.
Lookup table
A lookup table is a rapid method for finding the position of a residue
in a sequence. One way to find the "A" in the sequence "NDAPL" is to
compare "A" to each residue in the sequence. A faster way, is to make
a table of all possible residues (23 for proteins) so that the computer
representation for the residue (i.e "A" is 1, "R" is 2, "N" is 3) is the
same as its position in the table. A value is then placed in the table that
indicates whether the residue is present in the sequence and, if it is,
where
it is present. For this
example the table has the value 1 at position 3, 2 at position 4, 3 at
position 1, 4 at 15, 5 at 11, and the remainning 18 positions are 0.
The position of the "A" in the sequence can then be determined in a single
step by looking it up at position 1 in the table.
We will consider a sequence, sequence 4, and look for sequences that are similar in the EMBL Nucleotide Sequence Database. This sequence is a real entry in this database, so we will expect to find a sequence that is a perfect match to our test sequence. Also we expect to find similar sequences, perhaps from closely related animals, or from nucleotide coding sequences for closely related proteins.
Consider the following FASTA submission form:
- The sequence, sequence 4 is entered into the textbox in FASTA format, which consists of a one-line header starting with a ">" symbol, followed by the sequence name. The sequence is then entered on new line(s). You can find out more about sequence formats here.
- "email" is chosen so that I will have the results delivered to the email address as soon as they are available. As a FASTA search is very resource-intensive, it is not usually possible to search the whole of the EMBL database interactively.
- The title of the search is left as "_Sequence" although you can give your search title any name you wish to help you identify the results.
- The FASTA3 program is used, which is designed to search a nucleotide query sequence against a DNA databank, in this case the EMBL Standard .
- The number of scores (hits to the database) to is limited 10 and the number of alignments of these against the query sequence to is also limited 10, this is in order to limit the size of the output results.
- Other options have been left on "default"
- You now can either go to the FASTA page and run the search yourself or view the sample results for sequence 4.
- Which protein does this RNA sequence code for?
See an explanation of the results of this FASTA search >>> |
|