Comparing two or more things in biological data allows us to examine how closely related they might be, either in terms of function, evolution, or both.

The most frequently used type of comparison in bioinformatics is sequence comparison to work out how closely related a nucleotide or protein sequence is to others in the public databases. This is done by aligning the sequences – rearranging them to find the best match possible – and takes into consideration insertions, deletions, and substitutions that may have occurred since divergence from a theoretical common ancestor. If a match is found we might be able to infer something about the relationship between sequences. We can perform pairwise sequence alignments and multiple sequence alignments; there are numerous different tools for performing such alignments, and the right one to use will vary depending on the context.


When it comes to comparing a sequence to entries in a sequence database (sometimes called sequence similarity searching) the challenge is in assessing whether a particular alignment is significant, not in the alignment itself. In this case, an alignment is significant when the likelihood of it occurring by chance (i.e. randomly) is small. This is expressed as the expectation score (also known as an e-value) where the smaller the score, the more significant the alignment, and the more likely it is due to the existence of a shared ancestor and thus homology. Controls to check the validity of the a sequence similarity search include comparing random sequences, and assessing the score of unrelated sequences. When performing sequence searches it is important to consider the tool that you are using as tools vary in speed and accuracy and may be better suited to certain applications.

To learn more about sequence similarity searches and how to choose an appropriate tool watch our webinar on sequence similarity tools.