0%

What is Pfam?

Proteins generally have one or more functional region, which are commonly termed ‘domains’. The presence of different domains in varying combinations on different proteins gives rise to the diverse functional repertoire found in nature. Identifying the domain(s) present in a protein can provide insight into the function of that protein. Pfam is a database of these conserved evolutionary units.

Each Pfam entry is represented by a set of aligned sequences with their probabilistic representation – called a profile hidden Markov model (HMM). The profile HMM is trained on a small representative set of aligned sequences that are known to belong to the family (the ‘seed’ alignment). This model is then used to search exhaustively against a large sequence database (e.g. UniProtKB) to find all homologous sequences. Those sequences that are significantly similar to the model are aligned to the profile HMM in order to provide the full alignment.

Related Pfam entries may be grouped into sets, labelled as ‘Clans’. These are typically large and divergent superfamilies, where a single profile HMM is insufficient to capture all members of a sequence.

Why do we need Pfam?

Our ability to generate sequence data far exceeds the rate at which we can functionally characterise sequences experimentally. Therefore, computational methods are needed to help identify regions of similarity between sequences. Matching sequences to a Pfam entry allows us to transfer the functional information from an experimentally characterised sequence to uncharacterised sequences in the same entry. Pfam then provides comprehensive annotation for each entry. 

What can I do with Pfam?

With Pfam you can:

  • search your sequence against our models
  • search the database by keywords
  • browse our entries and clans and view relationships between entries in a clan
  • retrieve text annotation, structure information, protein taxonomy distribution, alignments, and other data about any given entry