spacer
spacer

2Can Support Portal - Nucleotide Analysis


FASTA similarity search - Introduction


FASTA (pronounced FAST-AYE) stands for FAST-ALL, reflecting the fact that it can be used for a fast protein comparison or a fast nucleotide comparison. This program achieves a high level of sensitivity for similarity searching at high speed. This is achieved by performing optimised searches for local alignments using a substitution matrix, in this case a DNA identity matrix.

The high speed of this program is achieved by using the observed pattern of word hits to identify potential matches before attempting the more time consuming optimised search. The trade-off between speed and sensitivity is controlled by the ktup parameter, which specifies the size of the word. Increasing the ktup decreases the number of background hits. Not every word hit is investigated but instead it initially looks for segment's containing several nearby hits. This program is much more sensitive than BLAST programs, which is reflected by the length of time required to produce results. FASTA produces optimal local alignment scores for the comparison of the query sequence to every sequence in the database. The majority of these scores involve unrelated sequences, and therefore can be used to estimate lambda and K values. These are statistical parameters estimated from the distribution of unrelated sequence similarity scores. This approach avoids the artificiality of a random sequence model by employing real sequences, with their natural correlations.

FASTA uses four steps to calculate three scores that characterise sequence similarity. These steps are outlined below. A representation of these steps is reported in a postscript format figure drawn from Barton (1994) Protein Sequence Alignment and Database Scanning.

Step 1 : Identify regions shared by the two sequences with the highest density of identities (ktup=1) or pairs of identities (ktup=2).

The first step uses a rapid technique for finding identities shared between two sequences; the method is similar to an earlier technique described by Wilbur and Lipman. FASTA achieves much of its speed and selectivity in this step by using a lookup table to locate all identities or groups of identities between two DNA or amino acid sequences during the first step of the comparison. The ktup parameter determines how many consecutive identities are required in a match. A ktup value of 2 is frequrntly used for protein sequence comparison, which means that the program examines only those portions of the two sequences being compared that have at least two adjacent identical residues in both sequences. More sensitive searches can be done using ktup = 1. For DNA sequence comparisons, the ktup parameter can range from 1 to 6; values between 4 and 6 are recommanded. When the query sequence is a short oliginucleotide of oligopeptude, ktup = 1 should be used.

In conjunction with the lookup table, we use the "diagonal" method to find all regions of similarity between the two sequences, counting ktup matches and penalizing for intervening mismatches. This method identified regions of a diagonal that have the highest densitu of ktup matches. The term diagonal refers to the diagonal line that is seen on a dot matrix plot when a sequence is compared with itself, and it denotes an alignment between two sequenves without gaps. FASTA uses a formula for scoring ktup matches that incorporates the actual PAM250 values for the aligned residues. Thus, groups of identities with high similarity scores contribute more to the local diagonal score than to identities with low similarity scores. This more sensitive formula is used for protein sequence comparisons; the constant value for ktup matches is used for DNA sequence comparisons. FASTA saves the 10 best local regions, regardless of whether they are on the same of different diagonals.

Step 2 : Rescan the 10 regions with the highest density of identities using the PAM250 matrix. Trim the ends of the region to include only those residues contributing to the highest score. Each region is a partial alignment without gaps.

After the 10 best local regions are found in the first step, they are rescored using a scoring matrix that allows runs of identities shorter than ktup residues and conservative replacements to contribute to the similarity score. For protein sequences, this score is usually caculated using the PAM250 matrix, although scoring matrices based on the minimum number of base changes required for a specific replacement, on identities alone, or on an alternative measure of similarity, can also be used with FASTA. The PAM250 scoring matrix was derived from the analysis of the amino acid replacements occuring among related proteins, and it specifies a range of positive scores for replacements that commonly occur among related proteins and negative scores for unlikely replacements. FASTA can also be used for DNA sequence comparisons, and matrices can be constructed that allow separate penalties for transitions and transversions.

For each of the best diagonal regions rescanned with the scoring matrix, a subregion with the maximal score is identified. Initial scores are used to rank the library sequences. These scores are referred to as init1 score.

Step 3 : If there are several initial regions with scores greater than the CUTOFF value, check to see whether the trimmed initial regions can be joined to form an approximate alignment with gaps. Calculate a similarity score that is the sum of the joined initial regions minus a penalty (usually 20) for each gap. This initial similarity score (initn) is used to rank the library sequences. The score of the single best initial region found in step 2 is reported (init1).

FASTA checks, during a library search, to see whether several initial regions can be joined together in a single alignment to increase the initial score. FASTA calculates an optimal alignment of initial regions as a combination of compatible regions with maximal score. This optimal alignment of initial regions can be rapidily calculated using a dynamic programming algorithm. FASTA uses the resulting score, referred to as the initn score, to rank the library sequences. The third "joining" step in the computation of the initial score increases the sensitivity of the search method because it allows for insertions and deletions as well as conservative replacements. The modification does, however, decrease selectivity. The degradation selectivity is limited by including in the optimization step only those initial regions whose scores are above an empirically determined threshold : FASTA joins an initial region only if its similarity score is greater than the cutoff value, a value that is approximately one standard deviation above the average score expected from unrelated sequences in the library. For a 200-residue query sequence and ktup-2, this value is 28.

Step 4 : constructs NWS (Needleman-Wunch-Sellers algorithm) optimal alignment of the query sequence and the library sequence, considering only those residues that lie in a band 32 residues wide centered on the best initial region found in Step 2. FASTA reports this score as the optimized (opt) score. After a complete search of the library, FASTA plots the initial scores of each library sequence in a histogram, calculates the mean similarity score for the query sequence against each sequence in the library, and determines the standard deviation of the distribution of initial scores. The initial scores are used to rank the library sequences, and, in the fourth and final step of the comparison, the highest scoring library sequences are aligned using a modification of the standard NWS optimization method. The optimization employs the same scoring matrix used in determining the initial regions; the resulting optimized alignments are calculated for further analysis of potential relationships, and the optimized similarity score is reported.

Lookup table

A lookup table is a rapid method for finding the position of a residue in a sequence. One way to find the "A" in the sequence "NDAPL" is to compare "A" to each residue in the sequence. A faster way, is to make a table of all possible residues (23 for proteins) so that the computer representation for the residue (i.e "A" is 1, "R" is 2, "N" is 3) is the same as its position in the table. A value is then placed in the table that indicates whether the residue is present in the sequence and, if it is, where it is present. For this example the table has the value 1 at position 3, 2 at position 4, 3 at position 1, 4 at 15, 5 at 11, and the remainning 18 positions are 0. The position of the "A" in the sequence can then be determined in a single step by looking it up at position 1 in the table.

We will consider a sequence, sequence 4, and look for sequences that are similar in the EMBL Nucleotide Sequence Database. This sequence is a real entry in this database, so we will expect to find a sequence that is a perfect match to our test sequence. Also we expect to find similar sequences, perhaps from closely related animals, or from nucleotide coding sequences for closely related proteins.

Consider the following FASTA submission form:

Go to the main EBI website

N.B. This tool can also be programmatically accessed as a web service.

  • The sequence, sequence 4 is entered into the textbox in FASTA format, which consists of a one-line header starting with a ">" symbol, followed by the sequence name. The sequence is then entered on new line(s). You can find out more about sequence formats here.

  • "email" is chosen so that I will have the results delivered to the email address as soon as they are available. As a FASTA search is very resource-intensive, it is not usually possible to search the whole of the EMBL database interactively.

  • The title of the search is left as "_Sequence" although you can give your search title any name you wish to help you identify the results.

  • The FASTA3 program is used, which is designed to search a nucleotide query sequence against a DNA databank, in this case the EMBL Standard .

  • The number of scores (hits to the database) to is limited 10 and the number of alignments of these against the query sequence to is also limited 10, this is in order to limit the size of the output results.

  • Other options have been left on "default"

  • You now can either go to the FASTA page and run the search yourself or view the sample results for sequence 4.

  • Which protein does this RNA sequence code for?

See an explanation of the results of this FASTA search >>>


<<< Previous || Start of Lesson || Next >>>

spacer
spacer