spacer
spacer

Protein Databases

<<< 7/11 >>>

Secondary protein databases

Very often the sequence of an unknown protein is too distantly related to any protein of known structure to detect its resemblance by overall sequence alignment, but it can be identified by the occurrence in its sequence of a particular cluster of residue types which is commonly known as a pattern, motif, signature, or fingerprint.

These motifs arise because of particular requirements on the structure of specific region(s) of a protein, which may be important, for example, for their binding properties or for their enzymatic activity. These requirements impose very tight constraints on the evolution of those limited (in size) but important portion(s) of a protein sequence. A signature modelling such a site must be as short as possible, should detect all or most of the sequences it is designed to describe and should not give too many false positive results. In other words it must exhibit both high sensitivity and high specificity.

There are a few databases available, which use different methodology and a varying degree of biological information on the characterised protein families, domains and sites.

Examples of secondary protein databases include:
  • PROSITE - The special value of this database is the extensive documentation on many protein families, as defined by sequence domains or motifs. PROSITE contains biologically significant sites and patterns formulated in such a way that with appropriate computational tools it can rapidly and reliably identify to which family of proteins the new sequence belongs.

    The profile structure used in PROSITE is similar to but slightly more general than the one introduced by Gribskov and co-workers (Gribskov et al.,1987). Generalised profiles are remarkably similar to the specific type of Hidden Markov Models (HMMs) used in Pfam.

  • PRINTS - A different approach to pattern recognition, termed "fingerprinting" is used by this database. Within a sequence alignment, it is usual to find not one, but several motifs that characterise the aligned family. Diagnostically, it makes sense to use many, or all, of the conserved regions to build a family signature. In a database search, there is then a greater chance of identifying a distant relative, whether or not all parts of the signature are matched. The ability to tolerate mismatches, both at the level of residues within individual motifs, and at the level of motifs within the fingerprint as a whole, renders fingerprinting a powerful diagnostic technique.

  • Pfam - Another important secondary protein database is Pfam. The methodology used by Pfam to create protein family or domain signatures is Hidden Markov Models (HMMs). HMMs are closely related to profiles, but are based on probability theory methods. These allow a direct statistical approach to identifying and scoring matches, and also to combining information from a multiple alignment with prior knowledge.

    One feature that distinguishes HMMs and profiles from regular expressions and fingerprints is that the formers allow the full extent of a domain to be identified in a sequence. They are thus particularly useful when analysing multidomain proteins. The biggest drawback of Pfam is its lack of biological information (annotation) of the protein families.

  • BLOCKS - Blocks are multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins. The blocks for the Blocks Database are made automatically by looking for the most highly conserved regions in groups of proteins documented in InterPro.

  • SBASE - This is a protein domain library sequences database that contains annotated structural, functional, ligand-binding and topogenic segments of proteins, cross-referenced to all major sequence databases and sequence pattern collections.

Protein Databases <<< 7/11 >>>



References:

Gribskov, M., McLachlan, A.D., and Eisenberg D. (1987). Proc. Natl. Acad. Sci. U.S.A. 84, 4355-4358.



spacer
spacer