0%

Problems detecting repeats

Detection of repeat units is a challenge for a number of reasons:

  1. It is difficult to detect individual repeat units because on average they are relatively short (<60 aa)
  2. It is problematic to define the boundaries of repeats
  3. There is considerable sequence divergence among units of the same repeat (i.e. repeats within a single protein or members of a protein family might degenerate at sequence level)
  4. Repeats often correspond with areas of low compositional complexity such as disordered regions (i.e. separating the two different signals can be challenging)

Despite of the aforementioned problems with detecting repeats, there is a continuous effort in identifying and classifying them. One approach used for identifying repeats is based on sequence homology detection, because sequence is conserved among evolutionary related proteins, thus it is possible to use profile hidden Markov models (HMMs) to identify repeats.

Profile HMMs are described in detail in our Pfam Database: Creating Protein Families tutorial.

EMBL-EBI provides a tool, RADAR, which can be used to identify gapped approximate repeats and complex repeat architectures involving many different types of repeats.