What is sequence similarity search?

Sequence similarity search (SSS) is a technique used to identify sequences in a database that are similar to a query sequence. It helps in finding homologous sequences, inferring functions, and understanding evolutionary relationships between sequences.

Sequence similarity search tools align the query sequence with sequences in the database and score each alignment based on similarity. Results are ranked based on alignment scores and E-value, with the most similar sequences appearing at the top.

Scoring and statistics

  • E-value (Expect/Expectation value): The number of times you can expect to see a match by chance in a database of a particular size. Lower E-values indicate more significant matches, as the likelihood of the result occurring by chance is lower for sequence significance, an E-value threshold between 1e-3 to 1e-5 is commonly used.
  • Bit Score: A normalised score that indicates the quality of the alignment, taking into account the scoring matrix and gap penalties. This score is considered superior to E-value in the context of sequence searching, because it is independent of the database size and internal scoring system of the SSS tool. Database size is known to affect the reported E-values, particularly for small databases, where the same alignment would be reported with lower E-value (more significant). The higher the bit-score, the higher the confidence that the hit sequence is homologous to the query sequence.

Substitution matrices

  • For proteins, matrices like BLOSUM or PAM are used to score alignments based on the likelihood of amino acid substitutions
  • For nucleotides, simpler scoring systems are used

Purpose of sequence similarity search

  • Homology detection: Identifying sequences that share a common ancestor with the query sequence
  • Functional annotation: Predicting the function of a sequence by comparing it to sequences with known functions
  • Structural prediction: Inferring structural information based on similarity to sequences with known structures

In the next section, you will explore an EMBL-EBI resource – Job Dispatcher which provides tools that let you run sequence comparison analyses efficiently.