 |
2Can Support Portal - Protein and Proteomic AnalysisFASTA whole proteome search <<< 1/8 >>>
FASTA whole proteome search - Introduction
FASTA (pronounced FAST-Aye) stands for FAST-All, reflecting the fact that it can be used for a fast protein comparison or a fast nucleotide comparison. This program achieves a high level of sensitivity for similarity searching at high speed. This is achieved by performing optimised searches for local alignments using a substitution matrix, in this case a DNA identity matrix. The high speed of this program is achieved by using the observed pattern of word hits to identify potential matches before attempting the more time consuming optimised search. The trade-off between speed and sensitivity is controlled by the ktup parameter, which specifies the size of the word. Increasing the ktup decreases the number of background hits. Not every word hit is investigated but instead initially looks for segment's containing several nearby hits.
This program is much more sensitive than BLAST programs, which is reflected by the length of time required to produce results. FASTA produces optimal local alignment scores for the comparison of the query sequence to every sequence in the database. The majority of these scores involve unrelated sequences, and therefore can be used to estimate lambda and K values. These are statistical parameters estimated from the distribution of unrelated sequence similarity scores. This approach avoids the artificiality of a random sequence model by employing real sequences, with their natural correlations.
We will consider a sequence of mouse protein, sequence 9, and look for sequences that are similar in the and look for peptide/protein sequences that are similar in UniProt mouse data collection. Also you could consider searching against UniProt and Ensembl, as Ensembl is a good source of complete proteomes.
This sequence is a real entry in this database, so we will expect to find a sequence that is a perfect match to our test sequence. Also we expect to find similar protein sequences, perhaps from closely related proteins or isoforms.
Consider the following Proteomes & Genomes FASTA submission form:
- The sequence, sequence 9 is entered into the textbox in FASTA format, which consists of a one-line header starting with a ">" symbol, followed by the sequence name. The sequence is then entered on new line(s). You can find out more about sequence formats here.
- "interactive" is chosen so that I will have the results delivered to the browser as soon as they are available.
- The title of the search is left as "Sequence" although you can give your search title any name you wish to help you identify the results.
- The FASTA3 program is used, which is designed to search a protein query sequence against a protein databank, in this case I have selected the the UniProt complete and non-redundant protein sequence collection.
- The number of scores (hits to the database) is limited to 10 and the alignments of these against the query sequence is limited to 10, this was done to limit the size of the output results.
- Other options have been left on "default"
- You now can either go to the Proteomes & Genomes FASTA page and run the search yourself or view the sample results for sequence 9.
- Which protein does this sequence code for?
See the results of this Proteomes & Genomes FASTA search >>> |
|