 |
2Can Support Portal - Nucleotide AnalysisFASTA whole genome search <<< 1/9 >>>
FASTA whole genome search - Introduction
FASTA (pronounced FAST-Aye) stands for FAST-All, reflecting the fact that it can be used for a fast protein comparison or a fast nucleotide comparison. This program achieves a high level of sensitivity for similarity searching at high speed. This is achieved by performing optimised searches for local alignments using a substitution matrix, in this case a DNA identity matrix. The high speed of this program is achieved by using the observed pattern of word hits to identify potential matches before attempting the more time consuming optimised search. The trade-off between speed and sensitivity is controlled by the ktup parameter, which specifies the size of the word. Increasing the ktup decreases the number of background hits. Not every word hit is investigated but instead initially looks for segment's containing several nearby hits.
This program is much more sensitive than BLAST programs, which is reflected by the length of time required to produce results. FASTA produces optimal local alignment scores for the comparison of the query sequence to every sequence in the database. The majority of these scores involve unrelated sequences, and therefore can be used to estimate lambda and K values. These are statistical parameters estimated from the distribution of unrelated sequence similarity scores. This approach avoids the artificiality of a random sequence model by employing real sequences, with their natural correlations.
We will consider a sequence of rat DNA, sequence 5, and look for sequences that are similar in the EMBL Nucleotide Sequence Database Rat division.
This sequence is part of a real entry in this database, so we will expect to find a sequence that is a perfect match to our test sequence. Also we expect to find similar sequences from nucleotide coding sequences for closely related proteins.
Consider the following Proteomes, Genomes & WGS FASTA submission form:
- The sequence, sequence 5 is entered into the textbox in FASTA format, which consists of a one-line header starting with a ">" symbol, followed by the sequence name. The sequence is then entered on new line(s). You can find out more about sequence formats here.
- "interactive" is chosen so that I will have the results delivered to the browser as soon as they are available.
- The title of the search is left as "Sequence" although you can give your search title any name you wish to help you identify the results.
- The FASTA3 program is used, which is designed to search a nucleotide query sequence against a DNA databank, in this case the EMBL databases rat division, which contains WGS sequences.
- The number of scores (hits to the database) to 10 and the alignments of these against the query sequence is limited to 10, to limit the size of the output results.
- Other options have been left on "default"
- You now can either go to the Proteomes, Genomes & WGS FASTA and run the search yourself or view the sample results for sequence 5.
- Which protein does this DNA sequence code for?
See an explanation of this Proteomes, Genomes & WGS FASTA search >>> |
|