0%

Boundaries determination

While we have talked about defining the inclusion thresholds in Pfam, we have not talked about defining the actual repeating motif. In the case of repeat units, the hit boundaries can be ambiguous due to their short length and sequence divergence. Furthermore, the repeats near the termini of a sequence are less like than those in the middle, as such defining the correct repeating unit can be challenging. However, in the case of our perfect 6 repeat protein, this is actually very easy.

Plotting the sequence against itself using software such as Dotter or Dotlet will reveal the start and end of each repeat. However, in reality it is often significantly more difficult to see where one repeat stops and another begins due to the divergence in sequence.

In Pfam, we adopt a number of different approaches to define the correct boundaries:

(1) Using known structures to determine the repeating unit

(2) Using dot plots with different sequences plotted against each other

(3) Building profiles and shifting the boundaries until an optimal solution is found

In the perfect scenario, all Pfam repeat entries would model a single repeat unit (left in Figure 5), and for some repeats, such as the TAL_effector, this is the case. However, single repeats have the drawback that despite our best optimisation of inclusion thresholds, we cannot detect every instance of the repeat, in every sequence. As a result, you sometimes detect a mosaic pattern of hits across the sequence.

In some cases the repeat single units are impossible to significantly distinguish from random noise (a problem exacerbated by the entropy weighting scheme employed by HMMER3), and so a sequence containing multiple repeats (usually 2 or 3 single copies) is used to construct the profile HMM (centre in Figure 5). While this has the advantage of typically covering more of the tandem repeat in the sequence, it results in a series of overlapping matches, that may still only reflect a partial match to the profile HMM. In such cases, the graphical representations of the Pfam matches on the InterPro website may appear as strange patterns of matching lengths.

Finally, some repeats are modelled as the whole set of tandem repeats (right in Figure 5). While it may seem that this should capture the entire repetitive region, the local-local matching search algorithm means that multiple partial matches may occur, separated by instances of very poorly conserved repeats.

As stated previously, it can be useful to compare the profile HMM to known structures, to help determine the repeating unit. However, structure identification of proteins with tandem repeats can itself be difficult due to the nature of the repeat region being intrinsically unstructured.

Modelling of repeat units. In reality, the N and C termini repeat units can have less sequence similarity with the central units. This is due to fewer steric restrictions at the termini and also because they may have a role in capping the repeat structure. Therefore, the Pfam family model often represents multiple copies of the repeat units which offers an increased selectivity of the search model compared to a single repeat unit.
Figure 5 Modelling of repeat units

In reality, the N and C termini repeat units can have less sequence similarity with the central units. This is due to fewer steric restrictions at the termini and also because they may have a role in capping the repeat structure. Therefore, the Pfam family model often represents multiple copies of the repeat units which offers an increased selectivity of the search model compared to a single repeat unit.