|
Mus musculus in the Swiss-Prot database: its
relevance to developmental research
Michele Magrane and Rolf Apweiler.
EMBL Outstation - European Bioinformatics
Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge
CB10 1SD, United Kingdom.
Due to the large-scale sequencing and mapping effort
being carried out in relation to Mus musculus,
it has been selected as a model organism in Swiss-Prot
for priority annotation. This means that we aim to be
as complete as possible with new sequences and updates
added quickly, and cross-references provided to the
Mouse Genome Database. Swiss-Prot mouse entries contain
a large amount of information, including data relating
specifically to developmental proteins such as the function
of these proteins and at what stages of development
they are expressed. The Swiss-Prot database, therefore,
offers a number of features that are specifically useful
in the context of modern developmental research and
is a valuable resource for those working in the field.
The database can be accessed at http://www.ebi.ac.uk/uniprot/Documentation/index.html#SwissProt
and http://www.expasy.ch/sprot/sprot-top.html.
Swiss-Prot (Bairoch and Apweiler,
1999) is a curated protein sequence database which was
established in 1986 at the Department of Medical Biochemistry,
University of Geneva. It is maintained collaboratively
by the Swiss Institute of Bioinformatics (SIB) and the
EMBL Outstation - European Bioinformatics Institute.
The database distinguishes itself from other protein
sequence databases on the basis of three criteria: (i)
it provides a high level of annotation (ii) it is non-redundant
and (iii) it provides a high level of integration with
other databases. It is supplemented by TrEMBL (Bairoch
and Apweiler, 1999) which was created due
to the increasing number of sequences to be incorporated
into Swiss-Prot from genome sequencing and mapping projects.
TrEMBL is computer-annotated and contains translations
of all coding sequences in the EMBL Nucleotide Sequence
Database (Stoesser et al., 1999) which are not yet integrated
into Swiss-Prot. It is subdivided into two sections:
SP-TrEMBL which contains entries which will eventually
be incorporated into Swiss-Prot, and REM-TrEMBL which
contains sequences which will not. These include immunoglobulins
and T-cell receptors, synthetic sequences, patent application
sequences, fragments of less than 8 amino acids, and
coding sequences where there is strong evidence that
the sequence does not code for a real protein.
Mus musculus is one of a number of model organisms
which has been selected in Swiss-Prot for priority annotation
because it is the target of large-scale sequencing and
mapping. For these model organisms, we aim to be as
complete as possible by adding new sequences and updates
quickly while still providing a high level of annotation.
We also provide cross-references to specialised databases
and maintain specific documents relating to the organisms.
In the case of the mouse, we provide cross-links from
Swiss-Prot and TrEMBL entries to the Mouse Genome Database
(MGD) (Blake et al., 1999) and we also maintain an index
of those entries which contain MGD cross-references
(Fig. 1). This index can
be found at http://www.expasy.ch/cgi-bin/lists?mgdtosp.txt.
The inclusion of MGD cross-references in Swiss-Prot
mouse entries is an ongoing collaboration between the
two databases and allows Swiss-Prot users to gain access
to additional information relating to a particular gene
such as mapping data. Cross-linking has also allowed
the standardization of gene nomenclature in the database
by the use of the official gene symbols assigned by
the International Committee on Standardised Gene Nomenclature
for Mice. The official gene names along with any synonyms
are stored in the 'GN' line of a Swiss-Prot entry. There
are 3549 mouse entries in the current Swiss-Prot release
(Release 38) with 3021 cross-references to MGD. There
are an additional 4423 mouse entries in TrEMBL (Release
11) with 1502 MGD cross-references.
Each Swiss-Prot entry (Fig. 2)
contains core data i.e. sequence, citation and taxonomic
information, with additional information being added
from a variety of sources such as scientific literature,
other databases, similar entries, prediction programs
and external experts. This additional information includes
protein function, post-translational modifications,
subcellular location, tissue specificity, domains and
sites, secondary and quaternary structure, similarities
to other proteins, diseases associated with a particular
protein, and polymorphisms.
Some of this information relates specifically to developmental
proteins. For example, many entries contain a description
of the function of the protein within the developmental
process, as in the example below which is taken from
endothelial PAS domain protein 1 (Swiss-Prot accession
number: P97481). This information is stored in the comment
or 'CC' lines of an entry and is prefixed by the token
'Function'.
CC -!- FUNCTION: TRANSCRIPTION FACTOR
INVOLVED IN THE INDUCTION OF OXYGEN
CC REGULATED GENES.
SPECIFICALLY RECOGNISES AN 8 BP HYPOXIA RESPONSE
CC ELEMENT (HRE).
REGULATES THE VASCULAR ENDOTHELIAL GROWTH FACTOR
CC (VEGF) EXPRESSION
AND SEEMS TO BE IMPLICATED IN THE DEVELOPMENT OF
CC BLOOD VESSELS
AND THE TUBULAR SYSTEM OF LUNG. MAY ALSO PLAY A ROLE
CC IN THE FORMATION
OF THE ENDOTHELIUM GIVING RISE TO THE BLOOD BRAIN
CC BARRIER. POTENT
ACTIVATOR OF THE TIE-2 TYROSINE KINASE EXPRESSION.
Many entries also describe at what stage of
development and in what tissues a protein is expressed
, as in the example below from transcription factor
BF-2 (Swiss-Prot accession number: Q61345). This information
is also stored in the comment or 'CC' lines, prefixed
by the tokens 'Tissue specificity' and 'Developmental
stage'.
CC -!- TISSUE SPECIFICITY: PREDOMINANTLY
EXPRESSED IN THE CNS AND
CC TEMPORAL HALF
OF THE RETINA. ALSO EXPRESSED IN THE CONDENSED HEAD
CC MESENCHYME, METANEPHRIC
BLASTEMA OF THE DEVELOPING KIDNEY, CORTEX
CC OF THE ADRENAL
GLAND, CONDENSED MESENCHYME AT THE BASE OF THE
CC FOLLICLES OF
VIBRASSAE , AND CARTILAGE PERICHONDRIUM OF THE
CC DEVELOPING VERTEBRATE.
CC -!- DEVELOPMENTAL STAGE: AT E9.5 EMBRYOS,
EXPRESSED IN A LIMITED
CC REGION OF THE
NEUROEPITHELIUM AND ALSO IN THE TEMPORAL HALF OF THE
CC PRIMARY OPTIC
CUP AND THE OPTIC STALK. AT E10.5, SEEN IN THE
CC HYPOTHALAMUS,
TEMPORAL HALF OF THE OPTIC STALK, AND TEMPORAL
CC HEMIRETINA. AT
E12.5 AND E13.5 A HIGH EXPRESSION IS SEEN IN
CC REGIONS OF CONDENSED
MESENCHYME OF THE HEAD, AND AS
CC NEUROEPITHELIAL
CELLS BEGIN TO DIFFERENTIATE AND MIGRATE OUTWARD
CC FROM THE VENTRICULAR
ZONE, EXPRESSION DECLINES MARKEDLY. BY E16.5
CC LEVELS ARE DIMINISHED
AND RESTRICTED TO UNFUSED POCKETS ALONG THE
CC EXHAUSTED VENTRICULAR
ZONE.
There are also a number of keywords which
are specifically associated with proteins involved in
development. Keywords are contained in the 'KW' line
of a Swiss-Prot entry. These include 'Developmental
protein' which is a general keyword describing proteins
involved in the developmental process. There are also
a number of more specific keywords such as: 'Neurogenesis'
which is used for proteins involved in the formation
of the nervous system during embryonic development;
'Differentiation' which is used for proteins involved
in differentiation of the embryo; 'Homeobox' which is
used for those proteins which contain a homeobox domain;
'Gastrulation' which is used for proteins which play
a role during gastrulation; and 'Myogenesis' which is
used for proteins involved in the differentiation and
development of muscle. The complete list of the keywords
in use in Swiss-Prot can be found at http://www.expasy.ch/txt/keywlist.txt.
As can be seen, the Swiss-Prot database and its supplement,
TrEMBL, provide a useful and valuable resource in the
context of modern developmental research. The databases
can be accessed at http://www.ebi.ac.uk/uniprot/Documentation/index.html#SwissProt
and http://www.expasy.ch/sprot/sprot-top.html.
References:
Bairoch A, Apweiler R. 1999. The Swiss-Prot protein
sequence data bank and its supplement TrEMBL in 1999.
Nucl Acids Res 27:49-54.
Blake JA, Richardson JE, Davisson MT, Eppig JT and
the Mouse Genome Database Group. 1999. The Mouse Genome
Database (MGD): genetic and genomic information about
the laboratory mouse. Nucl Acids Res 27:95-98.
Stoesser G, Tuli MA, Lopez R, Sterk P. 1999. The EMBL
Nucleotide Sequence Database. Nucl Acids Res 27:18-24.
|