spacer
  spacer

 

The Swiss-Prot and TrEMBL Protein Sequence Database as a Tool to Model Regulatory and Metabolic Pathways


Swiss-Prot, established in 1986 and maintained collaboratively, since 1987, by the University of Geneva and the EMBL Data Library (now the EMBL Outstation - The European Bioinformatics Institute (EBI)), is the most widely used protein sequence database since it distinguishes itself from other sequence databases by three essential criteria:


MINIMAL REDUNDANCY
Many sequence databases contain, for a given protein sequence, separate entries which correspond to different literature reports. In Swiss-Prot we try as much as possible to merge all these data so as to minimise the redundancy of the database. If conflicts exist between various sequencing reports, they are indicated in the feature table of the corresponding entry.


INTEGRATION WITH OTHER DATABASES
It is important to provide the users of biomolecular databases with a degree of integration between the three types of sequence-related databases (nucleic acid sequences, protein sequences and protein tertiary structures) as well as with specialised data collections. Swiss-Prot is currently cross-referenced with 30 different databases. Cross-references are provided in the form of pointers to information related to Swiss-Prot entries and found in data collections other than Swiss-Prot.


ANNOTATION
One of Swiss-Prot's leading concepts from the very beginning was to provide far more than a simple collection of protein sequences, but rather a critical view of what is known or postulated about each of these sequences. In Swiss-Prot each sequence entry consists of the sequence data, the citation information (bibliographical references), the taxonomic data (description of the biological source of the protein), and the annotation which describes the following items:

  • Function(s) of the protein
  • Post-translational modification(s). For example carbohydrates, phosphorylation, acetylation, GPI-anchor, etc.
  • Domains and sites. For example calcium binding regions, ATP-binding sites, zinc fingers, homeobox, kringle, etc.
  • Secondary structure
  • Quaternary structure
  • Similarities to other proteins
  • Disease(s) associated with deficiencie(s) in the protein
  • Sequence conflicts, variants, etc.

In Swiss-Prot, annotation is mainly found in the comment lines (CC), in the feature table (FT) and in the keyword lines (KW). We use a controlled vocabulary whenever possible; this approach permits the easy retrieval of specific categories of data from the database.

We include as much annotation as possible in Swiss-Prot. To obtain this information we use, in addition to the publications that report new sequence data, review articles to periodically update the annotations of families or groups of proteins. We also make use of external experts, who have been recruited to send us their comments and updates concerning specific groups of proteins.
However, due to the increased data flow from genome projects to the sequence databases we face a number of challenges to our way of database annotation. The attachment of biological knowledge abstracted from publications to the sequences is a skilled and labour-intensive task. Maintaining the high quality of sequence and annotation in Swiss-Prot requires careful sequence analysis and detailed annotation of every entry. It is the rate-limiting step in the production of Swiss-Prot. The ever-increasing rate of determination of new sequences requires new approaches if Swiss-Prot is to keep up. While we do not wish to relax the high editorial standards of Swiss-Prot, it is clear that there is a limit to how much we can speed the annotation procedures. On the other hand, it is also vital that we make new sequences available as quickly as possible. To address this concern, we introduced in 1996 TrEMBL (Translation of EMBL nucleotide sequence database). TrEMBL consists of computer-annotated entries derived from the translation of all coding sequences (CDS) in the EMBL database, except for CDS already included in Swiss-Prot.

Swiss-Prot + TrEMBL represent the most complete and up-to-date protein sequence database with the lowest degree of redundancy and the highest standard of annotation publicly available today. However, to cope with the flood of sequence and functional data new techniques to speed up sequence analysis, information acquisition and data integration into Swiss-Prot + TrEMBL need to be developed.

Most of the sequence data nowadays is coming from genome projects and lacks biochemical evidence to provide hard data on the function of the protein. The prediction of functional information from primary sequence information is a comparative problem based on a set of general rules and relationships derived from the current set of known proteins. Modern sensitive database search algorithms find already characterised sequences similar to new sequences and enable us to annotate new sequences by analogy to old sequences. Secondary pattern and profile databases are used to enhance TrEMBL entries by adding information about the potential functions of proteins, metabolic pathways, active sites, cofactors, binding sites, domains, subcellular location, and other annotation. We are automating the similarity and motif searches to accelerate the upgrading of TrEMBL entries to Swiss-Prot standard. The annotation task, whether automated or carried out by database curators, can proceed far more quickly if large groups of related proteins, such as families of sequences sharing a similar motif, can be annotated together.

A collaborative environment of so-called "agents" has been implemented which enables the investigation of different possibilities to store, share and deduce biological data. We embedded in this environment software to automate and combine similarity searches, motif searches, special sequence analysis tools, and the parsing of verified information from related biomolecular databases. This serves as a framework for the automation of annotation and takes advantage of a rule-based system to analyse sequences by comparison to the biochemically characterised and well-annotated entries in Swiss-Prot to predict in a standardised way the functional properties of TrEMBL entries. The rule-based system consists of a growing number of rules and hierarchical classifications of the annotation content of Swiss-Prot entries, where all nodes in these hierarchical trees are linked to certain annotation. The rules consider the sequence analysis results to decide which node(s) in the classification tree(s) are sufficiently similar to the query sequence and lead subsequently to the incorporation of the appropriate annotation (linked to the node) in the TrEMBL entry. The incorporated annotation is flagged as annotation based on sequence analysis methods. We only add information based on our automatic analysis to TrEMBL entries, if we are convinced that the computer-generation creates correct annotation in more than 99% of the cases.

The tools currently in place enable us to add information about the potential function of the protein, metabolic pathways, active sites, catalytic activity, cofactors, binding sites, domains, subcellular location and other annotation to more than 20% of all new TrEMBL entries in a highly reliable way.

With this annotation concept of Swiss-Prot + TrEMBL we try to combine the strengths of annotation carefully done by human experts with biological knowledge and after consultation of the relevant literature and thorough sequence analysis with the power of automation of sequence analysis and computer-generation of annotation. Since predicted annotation assignments and assignments based on hard experimental evidence are clearly distinguishable, we present in TrEMBL highly reliable although putative functional predictions, without lowering the high editorial standards of the standard Swiss-Prot entries. Swiss-Prot + TrEMBL's comprehensiveness and high degree of integration with other databases, as well as the combination of clearly distinguishable experimental and predicted data in Swiss-Prot + TrEMBL makes this protein sequence database a central tool to model regulatory and metabolic pathways.



spacer
spacer