|
Swiss-Prot AND ITS COMPUTER-ANNOTATED
SUPPLEMENT TrEMBL:
HOW TO PRODUCE HIGH QUALITY AUTOMATIC ANNOTATION
Rolf Apweiler, Claire O'Donovan, Maria Jesus Martin,
Wolfgang Fleischmann, Henning Hermjakob, Steffen Moeller,
Sergio Contrino, Vivien Junker
The EMBL Outstation - The European Bioinformatics
Institute
Wellcome Trust Genome Campus
Hinxton, Cambridge CB10 1SD,UK.
ABSTRACT
Swiss-Prot (http://www.ebi.ac.uk/uniprot/Documentation/index.html#SwissProt)
is a protein sequence database with a high level of
annotation and integration with other databases, and
a minimal level of redundancy [1].
The ongoing genome sequencing projects have dramatically
increased the number of known protein sequences. To
make the sequence information available as quickly as
possible, we introduced TREMBL (TRanslation of EMBL
nucleotide sequence database), a supplement to Swiss-Prot.
TREMBL consists of computer-annotated entries derived
from the translation of all coding sequences (CDS) in
the EMBL database, except for CDS already included in
Swiss-Prot. Swiss-Prot + TREMBL provides the scientific
community with a comprehensive non-redundant protein
sequence databank. However, there is a clear need for
new techniques to enhance the production of Swiss-Prot
+ TREMBL to cope with the flood of sequence and functional
data. To achieve this, we are currently developing new
methods to accelerate sequence analysis, information
acquisition and data integration. Central to this effort
in future will be EDITtoTREMBL (Environment for Distributed
Information Transfer to TREMBL) a system which enables
the investigation of different possibilities to share
and deduce biological information. EDITtoTREMBL analyzes
sequences by comparison to the biochemically characterised
and well-annotated entries in Swiss-Prot to predict
in a standardised way the functional properties of the
TREMBL entries.
Keywords
Database, Sequence Analysis, Automation, Annotation,
Swiss-Prot, TrEMBL, Functional Information
1. INTRODUCTION
Swiss-Prot, established in 1986 and maintained
collaboratively, since 1987, by the University of Geneva
and the EMBL Data Library (now the EMBL Outstation -
The European Bioinformatics Institute (EBI)), is the
most widely used protein sequence database since it
distinguishes itself from other sequence databases by
three essential criteria:
Minimal Redundancy
Many sequence databases contain, for a given
protein sequence, separate entries which correspond
to different literature reports. In Swiss-Prot, we try
as much as possible to merge all these data so as to
minimise the redundancy of the database. If conflicts
exist between various sequencing reports, they are indicated
in the feature table (FT) of the corresponding entry.
Integration with other Databases
It is important to provide the users of biomolecular
databases with a high degree of interoperatibility between
the three types of sequence-related databases (nucleic
acid sequences, protein sequences and protein tertiary
structures) as well as with specialised data collections.
Swiss-Prot is currently cross-referenced by more than
250,000 links with 28 different databases. Cross-references
are provided in the form of pointers to information
related to Swiss-Prot entries and found in data collections
other than Swiss-Prot.
Annotation
One of Swiss-Prot's leading concepts from the
very beginning was to provide far more than a simple
collection of protein sequences, but rather a critical
view of what is known or postulated about each of these
sequences. A sample entry is shown in Figure
1 . In Swiss-Prot, each sequence entry consists
of the sequence data, the citation information (bibliographical
references), the taxonomic data (description of the
biological source of the protein), and the annotation
which describes the following items:
- Function(s) of the protein
- Post-translational modification(s). For example
carbohydrates, phosphorylation, acetylation, GPI-anchor,
etc.
- Domains and sites. For example calcium binding regions,
ATP-binding sites, zinc fingers, homeobox, kringle,
etc.
- Secondary structure
- Quaternary structure
- Similarities to other proteins
- Disease(s) associated with deficiencie(s) in the
protein
- Sequence conflicts, variants, etc.
In Swiss-Prot, annotation is mainly found in the comment
lines (CC), in the feature table (FT) and in the keyword
lines (KW). We use a controlled vocabulary whenever
possible; this approach permits the easy retrieval of
specific categories of data from the database.
We include as much annotation as possible in Swiss-Prot.
To obtain this information we use, in addition to the
publications that report new sequence data, review articles
to periodically update the annotations of families or
groups of proteins. We also make use of external experts,
who have been recruited to send us their comments and
updates concerning specific groups of proteins.
Figure 1. A sample entry from
Swiss-Prot
ID LDHM_HUMAN
STANDARD; PRT;
331 AA.
AC P00338;
DE L-LACTATE DEHYDROGENASE M CHAIN
(EC 1.1.1.27) (LDH-A).
GN LDHA.
OS HOMO SAPIENS (HUMAN).
OC EUKARYOTA; METAZOA; CHORDATA;
VERTEBRATA;
OC TETRAPODA; MAMMALIA; EUTHERIA;
PRIMATES.
RN [1]
RP SEQUENCE FROM N.A.
RX MEDLINE; 85127030.
RA TSUJIBO H., TIANO H.F., LI S.S.-L.;
RL EUR. J. BIOCHEM. 147:9-15(1985).
RN [2]
RP SEQUENCE FROM N.A.
RX MEDLINE; 86076881.
RA CHUNG F.Z., TSUJIBO H., BHATTACHARYYA
U., SHARIEF F.S.,
RA LI S.S.-L.;
RL BIOCHEM. J. 231:537-541(1985).
RN [3]
RP VARIANT CYS-314.
RX MEDLINE; 93075246.
RA SUDO K., MAEKAWA M., SHIOYA M.,
IKEDA K., TAKAHASHI N.,
RA ISOGAI Y.,
RA LI S.S.-L., KANNO T., MACHIDA K.,
TORIUMI J.;
RL BIOCHEM. INT. 27:1051-1057(1992).
RN [4]
RP VARIANT GLU-221.
RX MEDLINE; 94199831.
RA MAEKAWA M., SUDO K., KOBAYASHI
A., SUGIYAMA E.,
RA LI S.S.-L., KANNO T.; RL
CLIN. CHEM. 40:665-668(1994).
CC -!- CATALYTIC ACTIVITY: L-LACTATE
+ NAD(+) = PYRUVATE +
CC NADH.
CC -!- SUBUNIT: HOMOTETRAMER.
CC -!- PATHWAY: FINAL STEP IN ANAEROBIC
GLYCOLYSIS.
CC -!- THERE ARE THREE TYPES OF LDH
CHAINS: M (LDH-A)
CC FOUND PREDOMINANTLY
IN MUSCLE TISSUES, H (LDH-B)
CC FOUND IN
HEART MUSCLE AND X (LDH-C) WHICH IS
CC PRESENT
IN THE SPERMATOZOA OF MAMMALS, IN THE
CC COLUMBIDAE
OF BIRDS AND IN ACTINOPTERYGIAN FISH.
CC -!- DISEASE: EXERTIONAL MYOGLOBINURIA
IS DUE TO A
CC DEFECT
IN LDH-A.
DR EMBL; X02152; G34313; -.
DR EMBL; X03077; G780261; -.
DR EMBL; X03078; G780261; JOINED.
DR EMBL; X03079; G780261; JOINED.
DR EMBL; X03080; G780261; JOINED.
DR EMBL; X03081; G780261; JOINED.
DR EMBL; X03082; G780261; JOINED.
DR EMBL; X03083; G780261; JOINED.
DR PIR; A00347; DEHULM.
DR HSSP; P00344; 1LDB.
DR AARHUS/GHENT-2DPAGE; 2207; NEPHGE.
DR MIM; 150000; -.
DR PROSITE; PS00064; L_LDH; 1.
KW OXIDOREDUCTASE; NAD; GLYCOLYSIS;
KW MULTIGENE FAMILY; DISEASE MUTATION;
POLYMORPHISM.
FT INIT_MET
0 0
FT ACT_SITE 192
192 ACCEPTS A PROTON
DURING
FT
CATALYSIS.
FT VARIANT
221 221
K -> E.
FT VARIANT
314 314
R -> C (IN LDHA DEFICIENCY).
SQ SEQUENCE 331 AA;
36557 MW; DF367487 CRC32;
//
The Challenge
Due to the increased data flow from genome
projects to the sequence databases we face a number
of challenges to our way of database annotation. Maintaining
the high quality of sequence and annotation in Swiss-Prot
requires careful sequence analysis and detailed annotation
of every entry. It is the rate-limiting step in the
production of Swiss-Prot. While we do not wish to relax
the high editorial standards of Swiss-Prot, it is clear
that there is a limit to how much we can accelerate
the annotation procedures. On the other hand, it is
also vital that we make new sequences available as quickly
as possible. To address this concern, we introduced
in 1996 TREMBL (TRanslation of EMBL nucleotide sequence
database). TREMBL consists of computer-annotated entries
derived from the translation of all coding sequences
(CDS) in the EMBL database, except for CDS already included
in Swiss-Prot [2].
2. THE PRODUCTION OF TREMBL
Translation and Entry Creation
The production of TREMBL is illustrated in Figure
2. All the EMBL nucleotide sequence database divisions
are scanned for CDS features and these are translated
to produce TREMBL division files containing TREMBL entries
in Swiss-Prot format. The program to produce TREMBL
is written in C and provides the basis for a first level
parsing of EMBL database entries. This level allows
text data to fit in structures such as ordered lists
of features or bibliographic references, to assemble
the coding sequences and to translate them. Each CDS
leading to a correct translation results in one entry
whose ID is the PID of the CDS. In the next step the
structures are scanned to extract relevant data, to
filter it and eventually to insert it properly formatted
into the TREMBL entry. Only bibliographic references
relevant to the given CDS are kept in the TREMBL entry.
This is achieved by scanning the RP (Reference Position)
lines of the EMBL entry and matching with the CDS position
in the sequence. The RC (Reference Comment) line is
built by assigning the Swiss-Prot equivalent of the
following EMBL qualifiers:
"/plasmid","PLASMid=",
"/strain","STRAIN=",
"/isolate","STRAIN=", (2nd choice)
"/cultivar","STRAIN=CV. "
"/tissue_type","TISSUE=",
"/transposon","TRANSPOSON=",
The description line (DE) comes from the /product qualifier
when present, otherwise we make use of the EMBL DE line,
the /gene and /note qualifiers. The EMBL DE line is
only considered if the EMBL entry contains only one
cds and is stripped of non-pertinent information such
as the organism name, or things like 'complete cds'.
The /gene qualifier is also used for the TREMBL GN line.
At the moment, because the EMBL and Swiss-Prot taxonomies
are slightly different, we use equivalence tables to
assign OS and OC lines in the entries. Where no equivalent
is found, the EMBL OS and OC lines are kept. Fortunately,
in the near future, Genbank, EMBL, DDBJ and Swiss-Prot
are going to adopt a new common taxonomic scheme [3-4
].
The EMBL keywords are included in the TREMBL entry,
but only when they match a subset of Swiss-Prot keywords
which have the same meaning. This occurs only in cases
where the EMBL entry has just one CDS so that no ambiguity
is possible. Some extra keywords derived from the features
and description lines are added.
A subset of Swiss-Prot features can be derived from
the EMBL entry features.
These are:
- SIGNAL from sig_peptide
- TRANSIT from transit_peptide
- CHAIN from mat_peptide
- VARIANT from allele, variation, misc_difference
and mutation
- CONFLICT from conflict
Two examples of TREMBL entries, created in the way
described before, are shown in Figure
3 . In addition to this information parsed into
TREMBL entries, data is put in the annotator's section
of the entry, which is not visible to the public. This
is used for further analysis both by programs and by
biologists and consists of:
- The EMBL entry description lines
- EMBL CC lines
- Bibliographic reference titles
- Full CDS feature text
- Full text of other relevant features within the
CDS range
- Number of CDS in the EMBL entry
- The date of the last entry update
- Information if the organism already exists in Swiss-Prot
Figure 3: First level TREMBL
entries (after translation and entry creation, sequence
not shown)
ID G34313 PRELIMINARY;
PRT; 332 AA.
AC X02152_1;
DT 23-DEC-1996 (EMBLREL. 49, CREATED)
DT 23-DEC-1996 (EMBLREL. 49, LAST
SEQUENCE UPDATE)
DT 23-DEC-1996 (EMBLREL. 49, LAST
ANNOTATION UPDATE)
DE LACTATE DEHYDROGENASE.
OS HOMO SAPIENS (HUMAN).
OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA;
OC TETRAPODA; MAMMALIA; EUTHERIA;
PRIMATES.
RN [1]
RP SEQUENCE FROM N.A.
RX MEDLINE; 85127030.
RA TSUJIBO H., TIANO H.F., LI S.S.-L.;
RL EUR. J. BIOCHEM. 147:9-15(1985).
DR EMBL; X02152; G34313; -.
SQ SEQUENCE 332 AA;
36689 MW; FF7595E2 CRC32;
//
ID G780261 PRELIMINARY;
PRT; 332 AA.
AC X03077_1;
DT 23-DEC-1996 (EMBLREL. 49, CREATED)
DT 23-DEC-1996 (EMBLREL. 49, LAST
SEQUENCE UPDATE)
DT 23-DEC-1996 (EMBLREL. 49, LAST
ANNOTATION UPDATE)
DE LACTATE DEHYDROGENASE.
OS HOMO SAPIENS (HUMAN).
OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA;
OC TETRAPODA; MAMMALIA; EUTHERIA;
PRIMATES.
RN [1]
RP SEQUENCE FROM N.A.
RX MEDLINE; 86076881.
RA CHUNG F.Z., TSUJIBO H., BHATTACHARYYA
U., SHARIEF F.S.,
RA LI S.S.-L.;
RL BIOCHEM. J. 231:537-541(1985).
DR EMBL; X03077; G780261; -.
DR EMBL; X03078; G780261; JOINED.
DR EMBL; X03079; G780261; JOINED.
DR EMBL; X03080; G780261; JOINED.
DR EMBL; X03081; G780261; JOINED.
DR EMBL; X03082; G780261; JOINED.
DR EMBL; X03083; G780261; JOINED.
SQ SEQUENCE 332 AA;
36689 MW; FF7595E2 CRC32;
//
Sorting the Entries
In the process of building TREMBL, different
types of entries are put into different output files:
- CDS with a /dbxref="Swiss-Prot" or a /dbxref="SPTREMBL"
are not translated (already in Swiss-Prot + TREMBL)
- CDS from mhc genes -> mhc.dat
- CDS from patent data -> patent.dat
- CDS from immunoglobulins and t-cell receptors ->
immuno.dat
- CDS smaller than 8 amino acids -> smalls.dat
- CDS from artificial, synthetic or chimeric genes
-> synthetic.dat
- CDS from pseudogenes -> pseudo.dat
- remaining CDS -> stay in their relative taxonomic
TREMBL divisions
At this stage the entries from the composite divisions
of the EMBL database (STS, EST, and UNC) are added to
their relative taxonomic TREMBL divisions.
Then all files are searched for entries that have recently
been added to Swiss-Prot but which do not yet have a
/dbxref="Swiss-Prot" qualifier in EMBL. These
entries are removed and TREMBL is split into two different
sections. SP-TREMBL (Swiss-Prot TREMBL) which contains
entries that will be added, after complete annotation,
to Swiss-Prot and REM-TREMBL (REMaining TREMBL) which
contains entries not for inclusion in Swiss-Prot. REM-TREMBL
consists of 5 files (patent.dat, immuno.dat, smalls.dat,
synthetic.dat, and pseudo.dat). SP-TREMBL consists of
13 files (fun.dat, inv.dat, hum.dat, mam.dat, mhc.dat,
org.dat, phg.dat, pln.dat, pro.dat, rod.dat, unc.dat,
vrl.dat and vrt.dat) which will undergo further post-processing.
Post-processing the SP-TREMBL Entries
To post-process the SP-TREMBL entries, a collection
of shell scripts and C and perl programs are used. The
first step is the reduction of redundancy. All full-length
proteins in SP-TREMBL with the same sequence are merged
into one entry. All fragment proteins with the same
sequence from the same organism are merged provided
they do not belong to a highly variable category of
proteins like MHC proteins or viral proteins. For all
Swiss-Prot entries, the CRC32 checksums of all the different
annotated sequence reports are calculated and compared
with the checksums of all SP-TREMBL entries. Identified
matches are removed from SP-TREMBL and integrated into
the corresponding Swiss-Prot entries. Figure
4 shows an example of an automatically merged TREMBL
entry, created by merging of the two TREMBL entries
shown in Figure 3. Merging sub-fragments
with full-length sequences and conflicting sequence
reports about the same sequence further reduces the
redundancy. Although these merging operations are automated,
all merged entries are finally checked by biologists
to avoid the merging of sequences from two different
but highly similar genes into one entry.
Figure 4: Second level TREMBL
entry (after merging, sequence not shown)
ID G34313 PRELIMINARY;
PRT; 332 AA. AC X02152_1;
DT 23-DEC-1996 (EMBLREL. 49, CREATED)
DT 23-DEC-1996 (EMBLREL. 49, LAST
SEQUENCE UPDATE)
DT 23-DEC-1996 (EMBLREL. 49, LAST
ANNOTATION UPDATE)
DE LACTATE DEHYDROGENASE.
OS HOMO SAPIENS (HUMAN).
OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA;
OC TETRAPODA; MAMMALIA;
OC EUTHERIA; PRIMATES.
RN [1]
RP SEQUENCE FROM N.A.
RX MEDLINE; 85127030.
RL EUR. J. BIOCHEM. 147:9-15(1985).
RN [2]
RP SEQUENCE FROM N.A.
RX MEDLINE; 86076881.
RA CHUNG F.Z., TSUJIBO H., BHATTACHARYYA
U., SHARIEF F.S.,
RA LI S.S.-L.;
RL BIOCHEM. J. 231:537-541(1985).
DR EMBL; X02152; G34313; -.
DR EMBL; X03077; G780261; -.
DR EMBL; X03078; G780261; JOINED
.
DR EMBL; X03079; G780261; JOINED.
DR EMBL; X03080; G780261; JOINED.
DR EMBL; X03081; G780261; JOINED.
DR EMBL; X03082; G780261; JOINED.
DR EMBL; X03083; G780261; JOINED.
SQ SEQUENCE 332 AA;
36689 MW; FF7595E2 CRC32;
//
The redundancy removal is done in collaboration
with Jean-Jacques Codani from INRIA, France. His group
developed LASSAP (LArge Scale Sequence compArison Package),
a programmable, high performance system designed to
overcome current limitations of sequence comparison
programs in order to fit the needs of large scale analysis
[5]. LASSAP allows the use of several
sequence comparison methods: BLAST, FASTA, dynamic programming
with local and global similarity searches, string matching
with or without errors and pattern matching with or
without errors. We use LASSAP to identify sub-fragments
to be merged with full-length sequences and to identify
conflicting sequence reports about the same sequence.
Identified matches are removed from SP-TREMBL and integrated
into the corresponding Swiss-Prot or SP-TREMBL entries.
The second post-processing step is the information enhancing
process. All SP-TREMBL entries are scanned for PROSITE
patterns [6]. If a matching pattern
is found, a three-step procedure is used to reduce the
number of false positive hits.
Firstly, the taxonomic classification of the SP-TREMBL
entry must be within the known taxonomic range of the
PROSITE pattern. For instance, a match of an a-priori
prokaryotic pattern against a human protein is regarded
as false positive and filtered out.
Secondly, the significance of the PROSITE pattern match
is checked. This is done by a second check of the SP-TREMBL
sequence with a set of secondary patterns derived from
the PROSITE pattern. These secondary patterns are computed
with the eMotif algorithm [7]. The
PROSITE database contains a list of all Swiss-Prot proteins
that are true members of the relevant protein family.
For each pattern, the true positive sequences are aligned
and fed into emotif, which computes a nearly optimal
set of regular expressions based on statistical rather
than biological evidence. We used a stringency of 10^-9,
so that each eMotif pattern is expected to produce on
random a false positive hit in 10^9 matches.
Thirdly, in cases where a protein family is characterised
by more than one PROSITE signature, all signatures must
be found in the entry. For instance, bacterial rhodopsins
have a signature for a conserved region in helix C and
another signature for the retinal binding lysine. If
a SP-TREMBL entry matches only the helix-C-pattern,
but not the retinal-binding pattern, it will not be
regarded as a bacterial rhodopsin.
The raw PROSITE hits and all results of the confirmation
steps are stored in a hidden section of the SP-TREMBL
entry, but only those hits that satisfy all confirmation
conditions are made publicly visible in a DR PROSITE
line.
Approximately 35% of all SP-TREMBL entries can be characterised
by a PROSITE signature but only around 30% of all SP-TREMBL
entries are true positive matches. The characterization
based only on PROSITE would lead to 10-20% of false
positive assignments. The confirmation steps reduce
the level of characterization by nearly a third to 25%.
At this stage, we achieve a level of less than 0.07%
of false positive assignments.
Whenever a SP-TREMBL entry is recognised by our procedures
as a true member of a certain protein family, annotation
about the potential function, active sites, cofactors,
binding sites, domains, subcellular locations is added
to the entry. The main source of the annotation is compiled
by extracting the annotation that is common to all Swiss-Prot
entries of the relevant protein family. Other sources
include manual descriptions of protein families and
translations of trustworthy description libraries into
Swiss-Prot wording. For example, there is a '/SITE=9,heme_iron'
description for the cytochrome_b_heme pattern in PROSITE.
This is translated to the correct Swiss-Prot syntax
'FT METAL nn nn IRON (HEME AXIAL LIGAND) (BY SIMILARITY).'
In other words, for every protein family, a "virtual
Swiss-Prot entry" is created computationally, which
is based on the specific annotation valid for all Swiss-Prot
members of this family. If we are sure that a new SP-TREMBL
protein belongs to a certain family, we can immediately
transfer the annotation of the virtual entry for this
family.
The "virtual Swiss-Prot entries" have a far-reaching
effect on SP-TREMBL. For example, the virtual entry
for Rubisco affects 2033 SP-TREMBL entries. Therefore
we developed a system to decompose these virtual entries
into rules, which are stored in a relational database
with proper version control features.
This rule-based system enables us to express the membership
criteria for each protein family in a formal language.
Furthermore, subfamilies have been introduced to meet
the Swiss-Prot standard more closely. For example, the
ribosomal protein L1 family contains eukaryotes as well
as prokaryotes. But the annotation added to SP-TREMBL
entries of this family obviously depends on the taxonomic
kingdom. The description reads '50S RIBOSOMAL PROTEIN
L1' for prokaryotes, archaebacteria, chloroplasts, and
cyanelles, and '60S RIBOSOMAL PROTEIN L10A' for non-chloroplast
encoded proteins of eukaryotes.
We also use the ENZYME database, using the EC number
as a reference point, to generate standardised description
lines for enzyme entries and to allow information such
as catalytic activity, cofactors and relevant keywords
to be taken from ENZYME and to be added automatically
to SP-TREMBL entries [8]. Furthermore
we use specialised databases like Flybase and MGD to
transfer information like the correct gene nomenclature
and cross-references to these databases into SP-TREMBL
entries [9-10]. The automatic analysis
and annotation of TREMBL entries is redone and updated
every TREMBL release.
The now fully post-processed TREMBL entry, already used
as an example before, is shown in Figure
5. Although this computer-generated annotation is
already enhancing the information about the sequence
drastically, it is still a long way to the quality of
the corresponding Swiss-Prot entry (shown in Figure
1), fully annotated by biologists.
Figure 5: Third level TREMBL entry
(after complete post-processing, sequence not shown)
ID P00338
PRELIMINARY; PRT; 332 AA.
AC P00338;
DT 01-FEB-1997 (TREMBLREL. 02, CREATED)
DT 01-FEB-1997 (TREMBLREL. 02, LAST
SEQUENCE UPDATE)
DT 01-FEB-1997 (TREMBLREL. 02, LAST
ANNOTATION UPDATE)
DE L-LACTATE DEHYDROGENASE (EC 1.1.1.27).
OS HOMO SAPIENS (HUMAN).
OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA;
OC TETRAPODA; MAMMALIA; EUTHERIA;
PRIMATES.
RN [1]
RP SEQUENCE FROM N.A.
RX MEDLINE; 85127030.
RA TSUJIBO H., TIANO H.F., LI S.S.-L.;
RL EUR. J. BIOCHEM. 147:9-15(1985).
RN [2]
RP SEQUENCE FROM N.A.
RX MEDLINE; 86076881.
RA CHUNG F.Z., TSUJIBO H., BHATTACHARYYA
U., SHARIEF F.S.,
RA LI S.S.-L.;
RL BIOCHEM. J. 231:537-541(1985).
CC -!- CATALYTIC ACTIVITY: L-LACTATE
+ NAD(+) = PYRUVATE +
CC NADH.
CC -!- SUBUNIT: HOMOTETRAMER (BY SIMILARITY).
CC -!- PATHWAY: FINAL STEP IN ANAEROBIC
GLYCOLYSIS.
DR EMBL; X02152; G34313; -.
DR EMBL; X03077; G780261; -.
DR EMBL; X03078; G780261; JOINED.
DR EMBL; X03079; G780261; JOINED.
DR EMBL; X03080; G780261; JOINED.
DR EMBL; X03081; G780261; JOINED.
DR EMBL; X03082; G780261; JOINED.
DR EMBL; X03083; G780261; JOINED.
DR PROSITE; PS00064; L_LDH; 1.
KW OXIDOREDUCTASE; NAD; GLYCOLYSIS.
FT ACT_SITE 193
193 BY SIMILARITY.
SQ SEQUENCE 332 AA;
36689 MW; FF7595E2 CRC32;
//
3. THE CURRENT STATUS OF
Swiss-Prot + TREMBL
In February 1998, Swiss-Prot contained 71,000
sequence entries comprising more than 25,000,000 amino
acids, and is supplemented by TREMBL release 5. The
corresponding EMBL release contained 290,000 CDS. 100,000
of these were already as sequence reports in Swiss-Prot
and have been removed from TREMBL. The remaining CDS
were merged whenever possible to reduce redundancy and
the resulting 166,000 entries were automatically annotated
and distributed as TREMBL release 5. Most of the sequence
entries currently in TREMBL are additional sequence
reports of entries already in Swiss-Prot and will lead
to updates of those Swiss-Prot entries. However, some
60,000 to 70,000 entries now in TREMBL will eventually
be included as new sequence entries in Swiss-Prot. Approximately
30% of the SP-TREMBL entries have been post-processed.
Swiss-Prot + TREMBL are currently cross-referenced by
470 000 verified links to 28 other databases. The sequences
and annotation of Swiss-Prot + TREMBL entries are constantly
updated. The doubling time of the database is now less
than 18 months. This underlines the fact that the ever-increasing
automation of SP-TREMBL annotation methods is the only
long-term viable approach to the constantly increasing
data flow. Swiss-Prot + TREMBL represent the most complete
and up-to-date protein sequence database with the lowest
degree of redundancy and the highest standard of annotation
publicly available today. However, to cope with the
flood of sequence and functional data new techniques
to accelerate sequence analysis, information acquisition
and data integration into Swiss-Prot + TREMBL need to
be developed.
4. THE FUTURE OF ANNOTATION IN TREMBL
Most of the sequence data nowadays is coming
from genome projects and lacks biochemical evidence
to provide hard data on the function of the protein.
The prediction of functional information from primary
sequence information is a comparative problem based
on a set of general rules and relationships derived
from the current set of known proteins. Sequence similarity
searches, pattern and profile searches, and clustering
of sequences are currently helping us to take advantage
of the relationship between primary sequence and function
in the annotation process. Modern sensitive database
search algorithms find already characterised sequences
similar to new sequences and enable us to annotate new
sequences by analogy to these sequences. Secondary pattern
and profile databases are used to enhance TREMBL entries
by adding information about the potential functions
of proteins, metabolic pathways, active sites, cofactors,
binding sites, domains, subcellular location, and other
annotation. We are automating the similarity and motif
searches to accelerate the upgrading of TREMBL entries
to Swiss-Prot standard. The annotation task, whether
automated or carried out by database curators, can proceed
far more quickly if large groups of related proteins,
such as families of sequences sharing a similar motif,
can be annotated together.
Central to our efforts to automate the annotation of
protein sequences is EDITtoTREMBL (Environment for Distributed
Information Transfer to TREMBL), a system that enables
the investigation of different possibilities to share
and deduce biological information (Figure
6). This new automated annotation environment is
implemented in Java and facilitates communication between
programs using Remote Method Invocation. EDITtoTREMBL
allows us to distribute the annotation process on different
machines and to integrate programs that are available
on specific platforms only. We embedded software in
this environment to automate and combine similarity
searches, motif searches, special sequence analysis
tools, and the transfer of verified information from
related biomolecular databases. The central components
of EDITtoTREMBL are the so-called Dispatchers and Analyzers.
The Dispatcher is a program that allows a supervised
information flow by distributing analysis tasks to different
Analyzers and by combining their output. Both components
take advantage of a rule-based system, where rules are
either manually created representing biological knowledge
or are the result of careful data-mining in Swiss-Prot
to predict in a standardised way the functional properties
of TREMBL entries. The rule-based system consists of
a growing number of rules and hierarchical classifications
of the annotation content of Swiss-Prot entries, where
all nodes in these hierarchical trees are linked to
certain annotation. The rules consider the sequence
analysis results to decide to which node(s) in the classification
tree(s) is the query sequence sufficiently similar to
and this leads subsequently to the incorporation of
the appropriate annotation (linked to the node) in the
TREMBL entry. The incorporated annotation is flagged
as annotation based on sequence analysis methods and
will be redone whenever a method or the annotation used
as the basis for the automated annotation of this entry
change. The rule-based system ensures that we add only
information based on our automatic analysis to TREMBL
entries, if we are convinced that the computer-generation
creates correct annotation in more than 99% of the cases.
With this annotation concept of Swiss-Prot + TREMBL,
we try to combine the strengths of annotation carefully
done by human experts with biological knowledge and
after consultation of the relevant literature and thorough
sequence analysis with the power of automation of sequence
analysis and computer-generation of annotation. Since
the predicted annotation assignments and the assignments
based on hard experimental evidence are clearly distinguishable,
we present in TREMBL highly reliable although putative
functional predictions, without lowering the high editorial
standards of the Swiss-Prot entries.
5. REFERENCES
[1] A. Bairoch, R. Apweiler,
"The Swiss-Prot protein sequence data bank and
its supplement TrEMBL in 1998", Nucleic Acids Research,
Vol. 25, 1998, pp. 31-36.
[2] G. Stoesser, M.A. Moseley, J.
Sleep, M. McGowran, M. Garcia-Pastor, Sterk P. "The
EMBL Nucleotide Sequence Database",Nucleic Acids
Research, Vol. 25, 1998, pp. 7-13.
[3] D.A. Benson, M. Boguski, D.J.
Lipman, J. Ostell, "GenBank", Nucleic Acid
Research, Vol. 25, 1997, pp. 1-6.
[4] Y. Tateno, T. Gojobori, "DNA
Data Bank of Japan in the age of information biology"
Nucleic Acid Research, Vol. 25, 1997, pp. 14-17.
[5] E. Glemet, J.-J. Codani, 1997.
"LASSAP, a Large Scale Sequence compArison Package"
Computer Applications in the Biosciences, Vol. 13, 1997,
pp. 137-143.
[6] A. Bairoch, P. Bucher, K. Hofmann,
"The PROSITE database, its status in 1997",
Nucleic Acid Research, Vol. 25., 1997, pp. 217-221.
[7] C.G. Nevill-Manning, K.S. Sethi,
T.D. Wu, D.L. Brutlag, "Enumerating and ranking
discrete motifs", Proc. Intelligent Systems for
Molecular Biology 97, 1997.
[8] A. Bairoch, "The ENZYME
data bank in 1995", Nucleic Acid Research, Vol.
24, 1996, pp. 221-222.
[9] FlyBase Consortium, "FlyBase:
a Drosophila database" Nucleic Acid Research, Vol.
25, 1997, pp. 63-66.
[10] J.A. Blake, J.E. Richardson,
M.T. Davisson, J.T. Eppig, the Mouse Genome Informatics
Group, "The Mouse Genome Database (MGD). A comprehensive
public resource of genetic, phenotypic and genomic data",
Nucleic Acid Research, Vol. 25, 1997, pp. 85-91.
|