0%

Threshold adjustment

In order to capture all examples of a repeating unit in a sequence, the inclusion threshold cut-offs (i.e. those sequence regions deemed to be part of the family) have to be carefully defined.

As described in our tutorial Pfam Database: Creating Protein Families, there are two different values in the HMMER output that we consider, E-values and bit scores. For each of these, there are two scores reported for every match, one for the individual hit and one for the sequence. The sequence bit score is the sum of all hit bit scores found between the sequence and the profile HMM. For most Pfam entries of type domain, the sequence and hit thresholds are very similar, around 25.0 bit, as you do not typically find domains repeated within a sequence. 

In the case of repeats, we often define very different sequence and hit bit-scores. Consider a sequence with 6 identical repeats. Each individual repeat may each score 15 bit, meaning the sequence has 90 bit score.  However, we have already highlighted that repeat sequence motifs degenerate, so in reality we usually have some repeats that match the consensus very well and others that do not. Therefore, it is more likely that our 6 repeat sequence contains imperfect repeats and we might find that the bit scores vary between 4 and 35 (see example in Figure 4). Setting a bit sequence of 50 and bit hit threshold of 4 would allow us to capture all instances of the repeat. Having a stringent sequence threshold requires the repeat to be observed multiple times, which prevents false positives from single matches that would happen at a threshold as low as 4 bit.

Figure 4 To detect all repeat units in one sequence, we set a stringent bit sequence score and a low bit hit score.