|
The Swiss-Prot and TrEMBL Protein
Sequence Database as a Tool to Model Regulatory and
Metabolic Pathways
Swiss-Prot, established in 1986 and maintained collaboratively,
since 1987, by the University of Geneva and the EMBL
Data Library (now the EMBL Outstation - The European
Bioinformatics Institute (EBI)), is the most widely
used protein sequence database since it distinguishes
itself from other sequence databases by three essential
criteria:
MINIMAL REDUNDANCY
Many sequence databases contain, for a given protein
sequence, separate entries which correspond to different
literature reports. In Swiss-Prot we try as much as
possible to merge all these data so as to minimise the
redundancy of the database. If conflicts exist between
various sequencing reports, they are indicated in the
feature table of the corresponding entry.
INTEGRATION WITH OTHER DATABASES
It is important to provide the users of biomolecular
databases with a degree of integration between the three
types of sequence-related databases (nucleic acid sequences,
protein sequences and protein tertiary structures) as
well as with specialised data collections. Swiss-Prot
is currently cross-referenced with 30 different databases.
Cross-references are provided in the form of pointers
to information related to Swiss-Prot entries and found
in data collections other than Swiss-Prot.
ANNOTATION
One of Swiss-Prot's leading concepts from the very
beginning was to provide far more than a simple collection
of protein sequences, but rather a critical view of
what is known or postulated about each of these sequences.
In Swiss-Prot each sequence entry consists of the sequence
data, the citation information (bibliographical references),
the taxonomic data (description of the biological source
of the protein), and the annotation which describes
the following items:
- Function(s) of the protein
- Post-translational modification(s). For example
carbohydrates, phosphorylation, acetylation, GPI-anchor,
etc.
- Domains and sites. For example calcium binding regions,
ATP-binding sites, zinc fingers, homeobox, kringle,
etc.
- Secondary structure
- Quaternary structure
- Similarities to other proteins
- Disease(s) associated with deficiencie(s) in the
protein
- Sequence conflicts, variants, etc.
In Swiss-Prot, annotation is mainly found in the comment
lines (CC), in the feature table (FT) and in the keyword
lines (KW). We use a controlled vocabulary whenever
possible; this approach permits the easy retrieval of
specific categories of data from the database.
We include as much annotation as possible in Swiss-Prot.
To obtain this information we use, in addition to the
publications that report new sequence data, review articles
to periodically update the annotations of families or
groups of proteins. We also make use of external experts,
who have been recruited to send us their comments and
updates concerning specific groups of proteins.
However, due to the increased data flow from genome
projects to the sequence databases we face a number
of challenges to our way of database annotation. The
attachment of biological knowledge abstracted from publications
to the sequences is a skilled and labour-intensive task.
Maintaining the high quality of sequence and annotation
in Swiss-Prot requires careful sequence analysis and
detailed annotation of every entry. It is the rate-limiting
step in the production of Swiss-Prot. The ever-increasing
rate of determination of new sequences requires new
approaches if Swiss-Prot is to keep up. While we do
not wish to relax the high editorial standards of Swiss-Prot,
it is clear that there is a limit to how much we can
speed the annotation procedures. On the other hand,
it is also vital that we make new sequences available
as quickly as possible. To address this concern, we
introduced in 1996 TrEMBL (Translation of EMBL nucleotide
sequence database). TrEMBL consists of computer-annotated
entries derived from the translation of all coding sequences
(CDS) in the EMBL database, except for CDS already included
in Swiss-Prot.
Swiss-Prot + TrEMBL represent the most complete and
up-to-date protein sequence database with the lowest
degree of redundancy and the highest standard of annotation
publicly available today. However, to cope with the
flood of sequence and functional data new techniques
to speed up sequence analysis, information acquisition
and data integration into Swiss-Prot + TrEMBL need to
be developed.
Most of the sequence data nowadays is coming from genome
projects and lacks biochemical evidence to provide hard
data on the function of the protein. The prediction
of functional information from primary sequence information
is a comparative problem based on a set of general rules
and relationships derived from the current set of known
proteins. Modern sensitive database search algorithms
find already characterised sequences similar to new
sequences and enable us to annotate new sequences by
analogy to old sequences. Secondary pattern and profile
databases are used to enhance TrEMBL entries by adding
information about the potential functions of proteins,
metabolic pathways, active sites, cofactors, binding
sites, domains, subcellular location, and other annotation.
We are automating the similarity and motif searches
to accelerate the upgrading of TrEMBL entries to Swiss-Prot
standard. The annotation task, whether automated or
carried out by database curators, can proceed far more
quickly if large groups of related proteins, such as
families of sequences sharing a similar motif, can be
annotated together.
A collaborative environment of so-called "agents"
has been implemented which enables the investigation
of different possibilities to store, share and deduce
biological data. We embedded in this environment software
to automate and combine similarity searches, motif searches,
special sequence analysis tools, and the parsing of
verified information from related biomolecular databases.
This serves as a framework for the automation of annotation
and takes advantage of a rule-based system to analyse
sequences by comparison to the biochemically characterised
and well-annotated entries in Swiss-Prot to predict
in a standardised way the functional properties of TrEMBL
entries. The rule-based system consists of a growing
number of rules and hierarchical classifications of
the annotation content of Swiss-Prot entries, where
all nodes in these hierarchical trees are linked to
certain annotation. The rules consider the sequence
analysis results to decide which node(s) in the classification
tree(s) are sufficiently similar to the query sequence
and lead subsequently to the incorporation of the appropriate
annotation (linked to the node) in the TrEMBL entry.
The incorporated annotation is flagged as annotation
based on sequence analysis methods. We only add information
based on our automatic analysis to TrEMBL entries, if
we are convinced that the computer-generation creates
correct annotation in more than 99% of the cases.
The tools currently in place enable us to add information
about the potential function of the protein, metabolic
pathways, active sites, catalytic activity, cofactors,
binding sites, domains, subcellular location and other
annotation to more than 20% of all new TrEMBL entries
in a highly reliable way.
With this annotation concept of Swiss-Prot + TrEMBL
we try to combine the strengths of annotation carefully
done by human experts with biological knowledge and
after consultation of the relevant literature and thorough
sequence analysis with the power of automation of sequence
analysis and computer-generation of annotation. Since
predicted annotation assignments and assignments based
on hard experimental evidence are clearly distinguishable,
we present in TrEMBL highly reliable although putative
functional predictions, without lowering the high editorial
standards of the standard Swiss-Prot entries. Swiss-Prot
+ TrEMBL's comprehensiveness and high degree of integration
with other databases, as well as the combination of
clearly distinguishable experimental and predicted data
in Swiss-Prot + TrEMBL makes this protein sequence database
a central tool to model regulatory and metabolic pathways. |