spacer
  spacer

 

Swiss-Prot AND ITS COMPUTER-ANNOTATED SUPPLEMENT TrEMBL:
HOW TO PRODUCE HIGH QUALITY AUTOMATIC ANNOTATION


Rolf Apweiler, Claire O'Donovan, Maria Jesus Martin, Wolfgang Fleischmann, Henning Hermjakob, Steffen Moeller, Sergio Contrino, Vivien Junker

The EMBL Outstation - The European Bioinformatics Institute
Wellcome Trust Genome Campus
Hinxton, Cambridge CB10 1SD,UK.

 

ABSTRACT

Swiss-Prot (http://www.ebi.ac.uk/uniprot/Documentation/index.html#SwissProt) is a protein sequence database with a high level of annotation and integration with other databases, and a minimal level of redundancy [1].

The ongoing genome sequencing projects have dramatically increased the number of known protein sequences. To make the sequence information available as quickly as possible, we introduced TREMBL (TRanslation of EMBL nucleotide sequence database), a supplement to Swiss-Prot. TREMBL consists of computer-annotated entries derived from the translation of all coding sequences (CDS) in the EMBL database, except for CDS already included in Swiss-Prot. Swiss-Prot + TREMBL provides the scientific community with a comprehensive non-redundant protein sequence databank. However, there is a clear need for new techniques to enhance the production of Swiss-Prot + TREMBL to cope with the flood of sequence and functional data. To achieve this, we are currently developing new methods to accelerate sequence analysis, information acquisition and data integration. Central to this effort in future will be EDITtoTREMBL (Environment for Distributed Information Transfer to TREMBL) a system which enables the investigation of different possibilities to share and deduce biological information. EDITtoTREMBL analyzes sequences by comparison to the biochemically characterised and well-annotated entries in Swiss-Prot to predict in a standardised way the functional properties of the TREMBL entries.


Keywords
Database, Sequence Analysis, Automation, Annotation, Swiss-Prot, TrEMBL, Functional Information

 

1. INTRODUCTION

Swiss-Prot, established in 1986 and maintained collaboratively, since 1987, by the University of Geneva and the EMBL Data Library (now the EMBL Outstation - The European Bioinformatics Institute (EBI)), is the most widely used protein sequence database since it distinguishes itself from other sequence databases by three essential criteria:

 

Minimal Redundancy

Many sequence databases contain, for a given protein sequence, separate entries which correspond to different literature reports. In Swiss-Prot, we try as much as possible to merge all these data so as to minimise the redundancy of the database. If conflicts exist between various sequencing reports, they are indicated in the feature table (FT) of the corresponding entry.

 

Integration with other Databases

It is important to provide the users of biomolecular databases with a high degree of interoperatibility between the three types of sequence-related databases (nucleic acid sequences, protein sequences and protein tertiary structures) as well as with specialised data collections. Swiss-Prot is currently cross-referenced by more than 250,000 links with 28 different databases. Cross-references are provided in the form of pointers to information related to Swiss-Prot entries and found in data collections other than Swiss-Prot.

 

Annotation

One of Swiss-Prot's leading concepts from the very beginning was to provide far more than a simple collection of protein sequences, but rather a critical view of what is known or postulated about each of these sequences. A sample entry is shown in Figure 1 . In Swiss-Prot, each sequence entry consists of the sequence data, the citation information (bibliographical references), the taxonomic data (description of the biological source of the protein), and the annotation which describes the following items:

  • Function(s) of the protein
  • Post-translational modification(s). For example carbohydrates, phosphorylation, acetylation, GPI-anchor, etc.
  • Domains and sites. For example calcium binding regions, ATP-binding sites, zinc fingers, homeobox, kringle, etc.
  • Secondary structure
  • Quaternary structure
  • Similarities to other proteins
  • Disease(s) associated with deficiencie(s) in the protein
  • Sequence conflicts, variants, etc.

In Swiss-Prot, annotation is mainly found in the comment lines (CC), in the feature table (FT) and in the keyword lines (KW). We use a controlled vocabulary whenever possible; this approach permits the easy retrieval of specific categories of data from the database.

We include as much annotation as possible in Swiss-Prot. To obtain this information we use, in addition to the publications that report new sequence data, review articles to periodically update the annotations of families or groups of proteins. We also make use of external experts, who have been recruited to send us their comments and updates concerning specific groups of proteins.
 

Figure 1. A sample entry from Swiss-Prot

 
 ID   LDHM_HUMAN     STANDARD;      PRT;   331 AA.
 AC   P00338;
 DE   L-LACTATE DEHYDROGENASE M CHAIN (EC 1.1.1.27) (LDH-A).
 GN   LDHA.
 OS   HOMO SAPIENS (HUMAN).
 OC   EUKARYOTA;  METAZOA; CHORDATA; VERTEBRATA; 
 OC   TETRAPODA; MAMMALIA; EUTHERIA; PRIMATES.
 RN   [1]
 RP   SEQUENCE FROM N.A.
 RX   MEDLINE; 85127030.
 RA   TSUJIBO H., TIANO H.F., LI S.S.-L.;
 RL   EUR. J. BIOCHEM. 147:9-15(1985).
 RN   [2]
 RP   SEQUENCE FROM N.A.
 RX   MEDLINE; 86076881.
 RA   CHUNG F.Z., TSUJIBO H., BHATTACHARYYA U., SHARIEF F.S.,
 RA   LI S.S.-L.;
 RL   BIOCHEM. J. 231:537-541(1985).
 RN   [3]
 RP   VARIANT CYS-314.
 RX   MEDLINE; 93075246.
 RA   SUDO K., MAEKAWA M., SHIOYA M., IKEDA K., TAKAHASHI N.,
 RA   ISOGAI Y.,
 RA   LI S.S.-L., KANNO T., MACHIDA K., TORIUMI J.;
 RL   BIOCHEM. INT. 27:1051-1057(1992).
 RN   [4]
 RP   VARIANT GLU-221.
 RX   MEDLINE; 94199831.
 RA   MAEKAWA M., SUDO K., KOBAYASHI A., SUGIYAMA E., 
 RA   LI S.S.-L., KANNO T.;  RL   CLIN. CHEM. 40:665-668(1994).
 CC   -!- CATALYTIC ACTIVITY: L-LACTATE + NAD(+) = PYRUVATE +
 CC       NADH.
 CC   -!- SUBUNIT: HOMOTETRAMER.
 CC   -!- PATHWAY: FINAL STEP IN ANAEROBIC GLYCOLYSIS.
 CC   -!- THERE ARE THREE TYPES OF LDH CHAINS: M (LDH-A) 
 CC       FOUND PREDOMINANTLY  IN MUSCLE TISSUES, H (LDH-B)
 CC       FOUND IN HEART MUSCLE AND X (LDH-C) WHICH IS 
 CC       PRESENT IN THE SPERMATOZOA OF MAMMALS, IN THE 
 CC       COLUMBIDAE OF BIRDS AND IN ACTINOPTERYGIAN FISH.
 CC   -!- DISEASE: EXERTIONAL MYOGLOBINURIA IS DUE TO A 
 CC       DEFECT IN LDH-A.
 DR   EMBL; X02152; G34313; -.
 DR   EMBL; X03077; G780261; -.
 DR   EMBL; X03078; G780261; JOINED.
 DR   EMBL; X03079; G780261; JOINED.
 DR   EMBL; X03080; G780261; JOINED.
 DR   EMBL; X03081; G780261; JOINED.
 DR   EMBL; X03082; G780261; JOINED.
 DR   EMBL; X03083; G780261; JOINED.
 DR   PIR; A00347; DEHULM.
 DR   HSSP; P00344; 1LDB.
 DR   AARHUS/GHENT-2DPAGE; 2207; NEPHGE.
 DR   MIM; 150000; -.
 DR   PROSITE; PS00064; L_LDH; 1.
 KW   OXIDOREDUCTASE; NAD; GLYCOLYSIS; 
 KW   MULTIGENE FAMILY; DISEASE MUTATION; POLYMORPHISM.
 FT   INIT_MET      0      0
 FT   ACT_SITE    192    192       ACCEPTS A PROTON DURING
 FT                                CATALYSIS.
 FT   VARIANT     221    221       K -> E.
 FT   VARIANT     314    314       R -> C (IN LDHA DEFICIENCY).
 SQ   SEQUENCE   331 AA;  36557 MW;  DF367487 CRC32;    
//
 

 

The Challenge

Due to the increased data flow from genome projects to the sequence databases we face a number of challenges to our way of database annotation. Maintaining the high quality of sequence and annotation in Swiss-Prot requires careful sequence analysis and detailed annotation of every entry. It is the rate-limiting step in the production of Swiss-Prot. While we do not wish to relax the high editorial standards of Swiss-Prot, it is clear that there is a limit to how much we can accelerate the annotation procedures. On the other hand, it is also vital that we make new sequences available as quickly as possible. To address this concern, we introduced in 1996 TREMBL (TRanslation of EMBL nucleotide sequence database). TREMBL consists of computer-annotated entries derived from the translation of all coding sequences (CDS) in the EMBL database, except for CDS already included in Swiss-Prot [2].

 

2. THE PRODUCTION OF TREMBL

 

Translation and Entry Creation

The production of TREMBL is illustrated in Figure 2. All the EMBL nucleotide sequence database divisions are scanned for CDS features and these are translated to produce TREMBL division files containing TREMBL entries in Swiss-Prot format. The program to produce TREMBL is written in C and provides the basis for a first level parsing of EMBL database entries. This level allows text data to fit in structures such as ordered lists of features or bibliographic references, to assemble the coding sequences and to translate them. Each CDS leading to a correct translation results in one entry whose ID is the PID of the CDS. In the next step the structures are scanned to extract relevant data, to filter it and eventually to insert it properly formatted into the TREMBL entry. Only bibliographic references relevant to the given CDS are kept in the TREMBL entry. This is achieved by scanning the RP (Reference Position) lines of the EMBL entry and matching with the CDS position in the sequence. The RC (Reference Comment) line is built by assigning the Swiss-Prot equivalent of the following EMBL qualifiers:

"/plasmid","PLASMid=",
"/strain","STRAIN=",
"/isolate","STRAIN=", (2nd choice)
"/cultivar","STRAIN=CV. "
"/tissue_type","TISSUE=",
"/transposon","TRANSPOSON=",

The description line (DE) comes from the /product qualifier when present, otherwise we make use of the EMBL DE line, the /gene and /note qualifiers. The EMBL DE line is only considered if the EMBL entry contains only one cds and is stripped of non-pertinent information such as the organism name, or things like 'complete cds'. The /gene qualifier is also used for the TREMBL GN line. At the moment, because the EMBL and Swiss-Prot taxonomies are slightly different, we use equivalence tables to assign OS and OC lines in the entries. Where no equivalent is found, the EMBL OS and OC lines are kept. Fortunately, in the near future, Genbank, EMBL, DDBJ and Swiss-Prot are going to adopt a new common taxonomic scheme [3-4 ].
The EMBL keywords are included in the TREMBL entry, but only when they match a subset of Swiss-Prot keywords which have the same meaning. This occurs only in cases where the EMBL entry has just one CDS so that no ambiguity is possible. Some extra keywords derived from the features and description lines are added.

A subset of Swiss-Prot features can be derived from the EMBL entry features.
These are:

  • SIGNAL from sig_peptide
  • TRANSIT from transit_peptide
  • CHAIN from mat_peptide
  • VARIANT from allele, variation, misc_difference and mutation
  • CONFLICT from conflict

Two examples of TREMBL entries, created in the way described before, are shown in Figure 3 . In addition to this information parsed into TREMBL entries, data is put in the annotator's section of the entry, which is not visible to the public. This is used for further analysis both by programs and by biologists and consists of:

  • The EMBL entry description lines
  • EMBL CC lines
  • Bibliographic reference titles
  • Full CDS feature text
  • Full text of other relevant features within the CDS range
  • Number of CDS in the EMBL entry
  • The date of the last entry update
  • Information if the organism already exists in Swiss-Prot

 

Figure 3: First level TREMBL entries (after translation and entry creation, sequence not shown)


 ID   G34313  PRELIMINARY;   PRT;  332 AA.
 AC   X02152_1;
 DT   23-DEC-1996 (EMBLREL. 49, CREATED)
 DT   23-DEC-1996 (EMBLREL. 49, LAST SEQUENCE UPDATE)
 DT   23-DEC-1996 (EMBLREL. 49, LAST ANNOTATION UPDATE)
 DE   LACTATE DEHYDROGENASE.
 OS   HOMO SAPIENS (HUMAN).
 OC   EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; 
 OC   TETRAPODA; MAMMALIA; EUTHERIA; PRIMATES.
 RN   [1]
 RP   SEQUENCE FROM N.A.
 RX   MEDLINE; 85127030.
 RA   TSUJIBO H., TIANO H.F., LI S.S.-L.;
 RL   EUR. J. BIOCHEM. 147:9-15(1985).
 DR   EMBL; X02152; G34313; -.
 SQ   SEQUENCE   332 AA;  36689 MW;  FF7595E2 CRC32;
//


 ID   G780261  PRELIMINARY;   PRT;  332 AA.
 AC   X03077_1;
 DT   23-DEC-1996 (EMBLREL. 49, CREATED)
 DT   23-DEC-1996 (EMBLREL. 49, LAST SEQUENCE UPDATE)
 DT   23-DEC-1996 (EMBLREL. 49, LAST ANNOTATION UPDATE)
 DE   LACTATE DEHYDROGENASE.
 OS   HOMO SAPIENS (HUMAN).
 OC   EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; 
 OC   TETRAPODA; MAMMALIA; EUTHERIA; PRIMATES.
 RN   [1]
 RP   SEQUENCE FROM N.A.
 RX   MEDLINE; 86076881.
 RA   CHUNG F.Z., TSUJIBO H., BHATTACHARYYA U., SHARIEF F.S.,
 RA   LI S.S.-L.;
 RL   BIOCHEM. J. 231:537-541(1985).
 DR   EMBL; X03077; G780261; -.
 DR   EMBL; X03078; G780261; JOINED.
 DR   EMBL; X03079; G780261; JOINED.
 DR   EMBL; X03080; G780261; JOINED.
 DR   EMBL; X03081; G780261; JOINED.
 DR   EMBL; X03082; G780261; JOINED.
 DR   EMBL; X03083; G780261; JOINED.
 SQ   SEQUENCE   332 AA;  36689 MW;  FF7595E2 CRC32;
//
 

 

Sorting the Entries

In the process of building TREMBL, different types of entries are put into different output files:

  • CDS with a /dbxref="Swiss-Prot" or a /dbxref="SPTREMBL" are not translated (already in Swiss-Prot + TREMBL)
  • CDS from mhc genes -> mhc.dat
  • CDS from patent data -> patent.dat
  • CDS from immunoglobulins and t-cell receptors -> immuno.dat
  • CDS smaller than 8 amino acids -> smalls.dat
  • CDS from artificial, synthetic or chimeric genes -> synthetic.dat
  • CDS from pseudogenes -> pseudo.dat
  • remaining CDS -> stay in their relative taxonomic TREMBL divisions

At this stage the entries from the composite divisions of the EMBL database (STS, EST, and UNC) are added to their relative taxonomic TREMBL divisions.
Then all files are searched for entries that have recently been added to Swiss-Prot but which do not yet have a /dbxref="Swiss-Prot" qualifier in EMBL. These entries are removed and TREMBL is split into two different sections. SP-TREMBL (Swiss-Prot TREMBL) which contains entries that will be added, after complete annotation, to Swiss-Prot and REM-TREMBL (REMaining TREMBL) which contains entries not for inclusion in Swiss-Prot. REM-TREMBL consists of 5 files (patent.dat, immuno.dat, smalls.dat, synthetic.dat, and pseudo.dat). SP-TREMBL consists of 13 files (fun.dat, inv.dat, hum.dat, mam.dat, mhc.dat, org.dat, phg.dat, pln.dat, pro.dat, rod.dat, unc.dat, vrl.dat and vrt.dat) which will undergo further post-processing.


Post-processing the SP-TREMBL Entries

To post-process the SP-TREMBL entries, a collection of shell scripts and C and perl programs are used. The first step is the reduction of redundancy. All full-length proteins in SP-TREMBL with the same sequence are merged into one entry. All fragment proteins with the same sequence from the same organism are merged provided they do not belong to a highly variable category of proteins like MHC proteins or viral proteins. For all Swiss-Prot entries, the CRC32 checksums of all the different annotated sequence reports are calculated and compared with the checksums of all SP-TREMBL entries. Identified matches are removed from SP-TREMBL and integrated into the corresponding Swiss-Prot entries. Figure 4 shows an example of an automatically merged TREMBL entry, created by merging of the two TREMBL entries shown in Figure 3. Merging sub-fragments with full-length sequences and conflicting sequence reports about the same sequence further reduces the redundancy. Although these merging operations are automated, all merged entries are finally checked by biologists to avoid the merging of sequences from two different but highly similar genes into one entry.

 

Figure 4: Second level TREMBL entry (after merging, sequence not shown)


 ID   G34313  PRELIMINARY;   PRT;  332 AA.  AC   X02152_1;
 DT   23-DEC-1996 (EMBLREL. 49, CREATED)
 DT   23-DEC-1996 (EMBLREL. 49, LAST SEQUENCE UPDATE)
 DT   23-DEC-1996 (EMBLREL. 49, LAST ANNOTATION UPDATE)
 DE   LACTATE DEHYDROGENASE.
 OS   HOMO SAPIENS (HUMAN).
 OC   EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; 
 OC   TETRAPODA; MAMMALIA;
 OC   EUTHERIA; PRIMATES.
 RN   [1]
 RP   SEQUENCE FROM N.A.
 RX   MEDLINE; 85127030.
 RL   EUR. J. BIOCHEM. 147:9-15(1985).
 RN   [2]
 RP   SEQUENCE FROM N.A.
 RX   MEDLINE; 86076881.
 RA   CHUNG F.Z., TSUJIBO H., BHATTACHARYYA U., SHARIEF F.S.,
 RA   LI S.S.-L.;
 RL   BIOCHEM. J. 231:537-541(1985).
 DR   EMBL; X02152; G34313; -.
 DR   EMBL; X03077; G780261; -.
 DR   EMBL; X03078; G780261; JOINED  .
 DR   EMBL; X03079; G780261; JOINED.
 DR   EMBL; X03080; G780261; JOINED.
 DR   EMBL; X03081; G780261; JOINED.
 DR   EMBL; X03082; G780261; JOINED.
 DR   EMBL; X03083; G780261; JOINED.
 SQ   SEQUENCE   332 AA;  36689 MW;  FF7595E2 CRC32;
//


The redundancy removal is done in collaboration with Jean-Jacques Codani from INRIA, France. His group developed LASSAP (LArge Scale Sequence compArison Package), a programmable, high performance system designed to overcome current limitations of sequence comparison programs in order to fit the needs of large scale analysis [5]. LASSAP allows the use of several sequence comparison methods: BLAST, FASTA, dynamic programming with local and global similarity searches, string matching with or without errors and pattern matching with or without errors. We use LASSAP to identify sub-fragments to be merged with full-length sequences and to identify conflicting sequence reports about the same sequence. Identified matches are removed from SP-TREMBL and integrated into the corresponding Swiss-Prot or SP-TREMBL entries.
The second post-processing step is the information enhancing process. All SP-TREMBL entries are scanned for PROSITE patterns [6]. If a matching pattern is found, a three-step procedure is used to reduce the number of false positive hits.
Firstly, the taxonomic classification of the SP-TREMBL entry must be within the known taxonomic range of the PROSITE pattern. For instance, a match of an a-priori prokaryotic pattern against a human protein is regarded as false positive and filtered out.
Secondly, the significance of the PROSITE pattern match is checked. This is done by a second check of the SP-TREMBL sequence with a set of secondary patterns derived from the PROSITE pattern. These secondary patterns are computed with the eMotif algorithm [7]. The PROSITE database contains a list of all Swiss-Prot proteins that are true members of the relevant protein family. For each pattern, the true positive sequences are aligned and fed into emotif, which computes a nearly optimal set of regular expressions based on statistical rather than biological evidence. We used a stringency of 10^-9, so that each eMotif pattern is expected to produce on random a false positive hit in 10^9 matches.
Thirdly, in cases where a protein family is characterised by more than one PROSITE signature, all signatures must be found in the entry. For instance, bacterial rhodopsins have a signature for a conserved region in helix C and another signature for the retinal binding lysine. If a SP-TREMBL entry matches only the helix-C-pattern, but not the retinal-binding pattern, it will not be regarded as a bacterial rhodopsin.
The raw PROSITE hits and all results of the confirmation steps are stored in a hidden section of the SP-TREMBL entry, but only those hits that satisfy all confirmation conditions are made publicly visible in a DR PROSITE line.
Approximately 35% of all SP-TREMBL entries can be characterised by a PROSITE signature but only around 30% of all SP-TREMBL entries are true positive matches. The characterization based only on PROSITE would lead to 10-20% of false positive assignments. The confirmation steps reduce the level of characterization by nearly a third to 25%. At this stage, we achieve a level of less than 0.07% of false positive assignments.
Whenever a SP-TREMBL entry is recognised by our procedures as a true member of a certain protein family, annotation about the potential function, active sites, cofactors, binding sites, domains, subcellular locations is added to the entry. The main source of the annotation is compiled by extracting the annotation that is common to all Swiss-Prot entries of the relevant protein family. Other sources include manual descriptions of protein families and translations of trustworthy description libraries into Swiss-Prot wording. For example, there is a '/SITE=9,heme_iron' description for the cytochrome_b_heme pattern in PROSITE. This is translated to the correct Swiss-Prot syntax

'FT METAL nn nn IRON (HEME AXIAL LIGAND) (BY SIMILARITY).'

In other words, for every protein family, a "virtual Swiss-Prot entry" is created computationally, which is based on the specific annotation valid for all Swiss-Prot members of this family. If we are sure that a new SP-TREMBL protein belongs to a certain family, we can immediately transfer the annotation of the virtual entry for this family.
The "virtual Swiss-Prot entries" have a far-reaching effect on SP-TREMBL. For example, the virtual entry for Rubisco affects 2033 SP-TREMBL entries. Therefore we developed a system to decompose these virtual entries into rules, which are stored in a relational database with proper version control features.
This rule-based system enables us to express the membership criteria for each protein family in a formal language. Furthermore, subfamilies have been introduced to meet the Swiss-Prot standard more closely. For example, the ribosomal protein L1 family contains eukaryotes as well as prokaryotes. But the annotation added to SP-TREMBL entries of this family obviously depends on the taxonomic kingdom. The description reads '50S RIBOSOMAL PROTEIN L1' for prokaryotes, archaebacteria, chloroplasts, and cyanelles, and '60S RIBOSOMAL PROTEIN L10A' for non-chloroplast encoded proteins of eukaryotes.
We also use the ENZYME database, using the EC number as a reference point, to generate standardised description lines for enzyme entries and to allow information such as catalytic activity, cofactors and relevant keywords to be taken from ENZYME and to be added automatically to SP-TREMBL entries [8]. Furthermore we use specialised databases like Flybase and MGD to transfer information like the correct gene nomenclature and cross-references to these databases into SP-TREMBL entries [9-10]. The automatic analysis and annotation of TREMBL entries is redone and updated every TREMBL release.
The now fully post-processed TREMBL entry, already used as an example before, is shown in Figure 5. Although this computer-generated annotation is already enhancing the information about the sequence drastically, it is still a long way to the quality of the corresponding Swiss-Prot entry (shown in Figure 1), fully annotated by biologists.


Figure 5: Third level TREMBL entry (after complete post-processing, sequence not shown)


 ID   P00338      PRELIMINARY;   PRT;   332 AA.
 AC   P00338;
 DT   01-FEB-1997 (TREMBLREL. 02, CREATED)
 DT   01-FEB-1997 (TREMBLREL. 02, LAST SEQUENCE UPDATE)
 DT   01-FEB-1997 (TREMBLREL. 02, LAST ANNOTATION UPDATE)
 DE   L-LACTATE DEHYDROGENASE (EC 1.1.1.27).
 OS   HOMO SAPIENS (HUMAN).
 OC   EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; 
 OC   TETRAPODA; MAMMALIA; EUTHERIA; PRIMATES.
 RN   [1]
 RP   SEQUENCE FROM N.A.
 RX   MEDLINE; 85127030.
 RA   TSUJIBO H., TIANO H.F., LI S.S.-L.;
 RL   EUR. J. BIOCHEM. 147:9-15(1985).
 RN   [2]
 RP   SEQUENCE FROM N.A.
 RX   MEDLINE; 86076881.
 RA   CHUNG F.Z., TSUJIBO H., BHATTACHARYYA U., SHARIEF F.S.,
 RA   LI S.S.-L.;
 RL   BIOCHEM. J. 231:537-541(1985).
 CC   -!- CATALYTIC ACTIVITY: L-LACTATE + NAD(+) = PYRUVATE +
 CC        NADH.
 CC   -!- SUBUNIT: HOMOTETRAMER (BY SIMILARITY).
 CC   -!- PATHWAY: FINAL STEP IN ANAEROBIC GLYCOLYSIS.
 DR   EMBL; X02152; G34313; -.
 DR   EMBL; X03077; G780261; -.  DR   EMBL; X03078; G780261; JOINED.
 DR   EMBL; X03079; G780261; JOINED.
 DR   EMBL; X03080; G780261; JOINED.
 DR   EMBL; X03081; G780261; JOINED.
 DR   EMBL; X03082; G780261; JOINED.
 DR   EMBL; X03083; G780261; JOINED.
 DR   PROSITE; PS00064; L_LDH; 1.
 KW   OXIDOREDUCTASE; NAD; GLYCOLYSIS.
 FT   ACT_SITE    193    193       BY SIMILARITY.
 SQ   SEQUENCE   332 AA;  36689 MW;  FF7595E2 CRC32;
//
 

 

3. THE CURRENT STATUS OF Swiss-Prot + TREMBL

In February 1998, Swiss-Prot contained 71,000 sequence entries comprising more than 25,000,000 amino acids, and is supplemented by TREMBL release 5. The corresponding EMBL release contained 290,000 CDS. 100,000 of these were already as sequence reports in Swiss-Prot and have been removed from TREMBL. The remaining CDS were merged whenever possible to reduce redundancy and the resulting 166,000 entries were automatically annotated and distributed as TREMBL release 5. Most of the sequence entries currently in TREMBL are additional sequence reports of entries already in Swiss-Prot and will lead to updates of those Swiss-Prot entries. However, some 60,000 to 70,000 entries now in TREMBL will eventually be included as new sequence entries in Swiss-Prot. Approximately 30% of the SP-TREMBL entries have been post-processed.
Swiss-Prot + TREMBL are currently cross-referenced by 470 000 verified links to 28 other databases. The sequences and annotation of Swiss-Prot + TREMBL entries are constantly updated. The doubling time of the database is now less than 18 months. This underlines the fact that the ever-increasing automation of SP-TREMBL annotation methods is the only long-term viable approach to the constantly increasing data flow. Swiss-Prot + TREMBL represent the most complete and up-to-date protein sequence database with the lowest degree of redundancy and the highest standard of annotation publicly available today. However, to cope with the flood of sequence and functional data new techniques to accelerate sequence analysis, information acquisition and data integration into Swiss-Prot + TREMBL need to be developed.

 

4. THE FUTURE OF ANNOTATION IN TREMBL

Most of the sequence data nowadays is coming from genome projects and lacks biochemical evidence to provide hard data on the function of the protein. The prediction of functional information from primary sequence information is a comparative problem based on a set of general rules and relationships derived from the current set of known proteins. Sequence similarity searches, pattern and profile searches, and clustering of sequences are currently helping us to take advantage of the relationship between primary sequence and function in the annotation process. Modern sensitive database search algorithms find already characterised sequences similar to new sequences and enable us to annotate new sequences by analogy to these sequences. Secondary pattern and profile databases are used to enhance TREMBL entries by adding information about the potential functions of proteins, metabolic pathways, active sites, cofactors, binding sites, domains, subcellular location, and other annotation. We are automating the similarity and motif searches to accelerate the upgrading of TREMBL entries to Swiss-Prot standard. The annotation task, whether automated or carried out by database curators, can proceed far more quickly if large groups of related proteins, such as families of sequences sharing a similar motif, can be annotated together.
Central to our efforts to automate the annotation of protein sequences is EDITtoTREMBL (Environment for Distributed Information Transfer to TREMBL), a system that enables the investigation of different possibilities to share and deduce biological information (Figure 6). This new automated annotation environment is implemented in Java and facilitates communication between programs using Remote Method Invocation. EDITtoTREMBL allows us to distribute the annotation process on different machines and to integrate programs that are available on specific platforms only. We embedded software in this environment to automate and combine similarity searches, motif searches, special sequence analysis tools, and the transfer of verified information from related biomolecular databases. The central components of EDITtoTREMBL are the so-called Dispatchers and Analyzers. The Dispatcher is a program that allows a supervised information flow by distributing analysis tasks to different Analyzers and by combining their output. Both components take advantage of a rule-based system, where rules are either manually created representing biological knowledge or are the result of careful data-mining in Swiss-Prot to predict in a standardised way the functional properties of TREMBL entries. The rule-based system consists of a growing number of rules and hierarchical classifications of the annotation content of Swiss-Prot entries, where all nodes in these hierarchical trees are linked to certain annotation. The rules consider the sequence analysis results to decide to which node(s) in the classification tree(s) is the query sequence sufficiently similar to and this leads subsequently to the incorporation of the appropriate annotation (linked to the node) in the TREMBL entry. The incorporated annotation is flagged as annotation based on sequence analysis methods and will be redone whenever a method or the annotation used as the basis for the automated annotation of this entry change. The rule-based system ensures that we add only information based on our automatic analysis to TREMBL entries, if we are convinced that the computer-generation creates correct annotation in more than 99% of the cases.
With this annotation concept of Swiss-Prot + TREMBL, we try to combine the strengths of annotation carefully done by human experts with biological knowledge and after consultation of the relevant literature and thorough sequence analysis with the power of automation of sequence analysis and computer-generation of annotation. Since the predicted annotation assignments and the assignments based on hard experimental evidence are clearly distinguishable, we present in TREMBL highly reliable although putative functional predictions, without lowering the high editorial standards of the Swiss-Prot entries.

 

5. REFERENCES

[1] A. Bairoch, R. Apweiler, "The Swiss-Prot protein sequence data bank and its supplement TrEMBL in 1998", Nucleic Acids Research, Vol. 25, 1998, pp. 31-36.

[2] G. Stoesser, M.A. Moseley, J. Sleep, M. McGowran, M. Garcia-Pastor, Sterk P. "The EMBL Nucleotide Sequence Database",Nucleic Acids Research, Vol. 25, 1998, pp. 7-13.

[3] D.A. Benson, M. Boguski, D.J. Lipman, J. Ostell, "GenBank", Nucleic Acid Research, Vol. 25, 1997, pp. 1-6.

[4] Y. Tateno, T. Gojobori, "DNA Data Bank of Japan in the age of information biology" Nucleic Acid Research, Vol. 25, 1997, pp. 14-17.

[5] E. Glemet, J.-J. Codani, 1997. "LASSAP, a Large Scale Sequence compArison Package" Computer Applications in the Biosciences, Vol. 13, 1997, pp. 137-143.

[6] A. Bairoch, P. Bucher, K. Hofmann, "The PROSITE database, its status in 1997", Nucleic Acid Research, Vol. 25., 1997, pp. 217-221.

[7] C.G. Nevill-Manning, K.S. Sethi, T.D. Wu, D.L. Brutlag, "Enumerating and ranking discrete motifs", Proc. Intelligent Systems for Molecular Biology 97, 1997.

[8] A. Bairoch, "The ENZYME data bank in 1995", Nucleic Acid Research, Vol. 24, 1996, pp. 221-222.

[9] FlyBase Consortium, "FlyBase: a Drosophila database" Nucleic Acid Research, Vol. 25, 1997, pp. 63-66.

[10] J.A. Blake, J.E. Richardson, M.T. Davisson, J.T. Eppig, the Mouse Genome Informatics Group, "The Mouse Genome Database (MGD). A comprehensive public resource of genetic, phenotypic and genomic data", Nucleic Acid Research, Vol. 25, 1997, pp. 85-91.

 


spacer
spacer