 |
Protein Databases
Secondary protein databases
Very often the sequence of an unknown protein is too distantly related to any
protein of known structure to detect its resemblance by overall sequence
alignment, but it can be identified by the occurrence in its sequence of a
particular cluster of residue types which is commonly known as a pattern,
motif, signature, or fingerprint.
These motifs arise because of particular
requirements on the structure of specific
region(s) of a protein, which may be important,
for example, for their binding properties or for
their enzymatic activity. These requirements
impose very tight constraints on the evolution of
those limited (in size) but important portion(s)
of a protein sequence. A signature modelling such
a site must be as short as possible, should detect
all or most of the sequences it is designed to describe
and should not give too many false positive results. In
other words it must exhibit both high sensitivity and high
specificity.
There are a few databases available, which use different methodology and a varying
degree of biological information on the characterised protein families, domains and sites.
Examples of secondary protein databases include:
- PROSITE -
The special value of this database is the extensive documentation on many protein families,
as defined by sequence domains or motifs. PROSITE contains biologically significant sites
and patterns formulated in such a way that with appropriate computational tools it can
rapidly and reliably identify to which family of proteins the new sequence belongs.
The profile structure used in PROSITE is similar to but slightly more general than
the one introduced by Gribskov and co-workers (Gribskov et al.,1987). Generalised
profiles are remarkably similar to the specific type of Hidden Markov Models (HMMs)
used in Pfam.
- PRINTS -
A different approach to pattern recognition, termed "fingerprinting" is used by this
database. Within a sequence alignment, it is usual to find not one, but several motifs that
characterise the aligned family. Diagnostically, it makes sense to use many, or all, of the
conserved regions to build a family signature. In a database search, there is then a greater
chance of identifying a distant relative, whether or not all parts of the signature are matched.
The ability to tolerate mismatches, both at the level of residues within individual motifs, and
at the level of motifs within the fingerprint as a whole, renders fingerprinting a powerful
diagnostic technique.
- Pfam -
Another important secondary protein database is Pfam. The methodology used by Pfam to
create protein family or domain signatures is Hidden Markov
Models (HMMs).
HMMs are closely related to profiles, but are based on probability theory methods. These
allow a direct statistical approach to identifying and scoring matches, and also to
combining information from a multiple alignment with prior knowledge.
One feature that distinguishes HMMs and profiles from
regular expressions and fingerprints is that the formers
allow the full extent of a domain to be identified in a
sequence. They are thus particularly useful when analysing
multidomain proteins. The biggest drawback of Pfam is its
lack of biological information (annotation) of the protein
families.
- BLOCKS -
Blocks are multiply aligned ungapped segments corresponding to the most highly conserved
regions of proteins. The blocks for the Blocks Database are made automatically by looking
for the most highly conserved regions in groups of proteins documented in InterPro.
- SBASE -
This is a protein domain library sequences database that contains annotated structural,
functional, ligand-binding and topogenic segments of proteins, cross-referenced to all
major sequence databases and sequence pattern collections.
|
|
|
Protein Databases <<< 7/11 >>> |
|