spacer
  spacer


Mus musculus in the Swiss-Prot database: its relevance to developmental research

Michele Magrane and Rolf Apweiler.

EMBL Outstation - European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom.

 

Due to the large-scale sequencing and mapping effort being carried out in relation to Mus musculus, it has been selected as a model organism in Swiss-Prot for priority annotation. This means that we aim to be as complete as possible with new sequences and updates added quickly, and cross-references provided to the Mouse Genome Database. Swiss-Prot mouse entries contain a large amount of information, including data relating specifically to developmental proteins such as the function of these proteins and at what stages of development they are expressed. The Swiss-Prot database, therefore, offers a number of features that are specifically useful in the context of modern developmental research and is a valuable resource for those working in the field. The database can be accessed at http://www.ebi.ac.uk/uniprot/Documentation/index.html#SwissProt and http://www.expasy.ch/sprot/sprot-top.html.

 

Swiss-Prot (Bairoch and Apweiler, 1999) is a curated protein sequence database which was established in 1986 at the Department of Medical Biochemistry, University of Geneva. It is maintained collaboratively by the Swiss Institute of Bioinformatics (SIB) and the EMBL Outstation - European Bioinformatics Institute. The database distinguishes itself from other protein sequence databases on the basis of three criteria: (i) it provides a high level of annotation (ii) it is non-redundant and (iii) it provides a high level of integration with other databases. It is supplemented by TrEMBL (Bairoch and Apweiler, 1999) which was created due to the increasing number of sequences to be incorporated into Swiss-Prot from genome sequencing and mapping projects. TrEMBL is computer-annotated and contains translations of all coding sequences in the EMBL Nucleotide Sequence Database (Stoesser et al., 1999) which are not yet integrated into Swiss-Prot. It is subdivided into two sections: SP-TrEMBL which contains entries which will eventually be incorporated into Swiss-Prot, and REM-TrEMBL which contains sequences which will not. These include immunoglobulins and T-cell receptors, synthetic sequences, patent application sequences, fragments of less than 8 amino acids, and coding sequences where there is strong evidence that the sequence does not code for a real protein.

Mus musculus is one of a number of model organisms which has been selected in Swiss-Prot for priority annotation because it is the target of large-scale sequencing and mapping. For these model organisms, we aim to be as complete as possible by adding new sequences and updates quickly while still providing a high level of annotation. We also provide cross-references to specialised databases and maintain specific documents relating to the organisms. In the case of the mouse, we provide cross-links from Swiss-Prot and TrEMBL entries to the Mouse Genome Database (MGD) (Blake et al., 1999) and we also maintain an index of those entries which contain MGD cross-references (Fig. 1). This index can be found at http://www.expasy.ch/cgi-bin/lists?mgdtosp.txt. The inclusion of MGD cross-references in Swiss-Prot mouse entries is an ongoing collaboration between the two databases and allows Swiss-Prot users to gain access to additional information relating to a particular gene such as mapping data. Cross-linking has also allowed the standardization of gene nomenclature in the database by the use of the official gene symbols assigned by the International Committee on Standardised Gene Nomenclature for Mice. The official gene names along with any synonyms are stored in the 'GN' line of a Swiss-Prot entry. There are 3549 mouse entries in the current Swiss-Prot release (Release 38) with 3021 cross-references to MGD. There are an additional 4423 mouse entries in TrEMBL (Release 11) with 1502 MGD cross-references.

 

Each Swiss-Prot entry (Fig. 2) contains core data i.e. sequence, citation and taxonomic information, with additional information being added from a variety of sources such as scientific literature, other databases, similar entries, prediction programs and external experts. This additional information includes protein function, post-translational modifications, subcellular location, tissue specificity, domains and sites, secondary and quaternary structure, similarities to other proteins, diseases associated with a particular protein, and polymorphisms.
 

Some of this information relates specifically to developmental proteins. For example, many entries contain a description of the function of the protein within the developmental process, as in the example below which is taken from endothelial PAS domain protein 1 (Swiss-Prot accession number: P97481). This information is stored in the comment or 'CC' lines of an entry and is prefixed by the token 'Function'.

CC   -!- FUNCTION: TRANSCRIPTION FACTOR INVOLVED IN THE INDUCTION OF OXYGEN
CC       REGULATED GENES. SPECIFICALLY RECOGNISES AN 8 BP HYPOXIA RESPONSE
CC       ELEMENT (HRE). REGULATES THE VASCULAR ENDOTHELIAL GROWTH FACTOR
CC       (VEGF) EXPRESSION AND SEEMS TO BE IMPLICATED IN THE DEVELOPMENT OF
CC       BLOOD VESSELS AND THE TUBULAR SYSTEM OF LUNG. MAY ALSO PLAY A ROLE
CC       IN THE FORMATION OF THE ENDOTHELIUM GIVING RISE TO THE BLOOD BRAIN
CC       BARRIER. POTENT ACTIVATOR OF THE TIE-2 TYROSINE KINASE EXPRESSION.
 

Many entries also describe at what stage of development and in what tissues a protein is expressed , as in the example below from transcription factor BF-2 (Swiss-Prot accession number: Q61345). This information is also stored in the comment or 'CC' lines, prefixed by the tokens 'Tissue specificity' and 'Developmental stage'.

CC   -!- TISSUE SPECIFICITY: PREDOMINANTLY EXPRESSED IN THE CNS AND
CC       TEMPORAL HALF OF THE RETINA. ALSO EXPRESSED IN THE CONDENSED HEAD
CC       MESENCHYME, METANEPHRIC BLASTEMA OF THE DEVELOPING KIDNEY, CORTEX
CC       OF THE ADRENAL GLAND, CONDENSED MESENCHYME AT THE BASE OF THE
CC       FOLLICLES OF VIBRASSAE , AND CARTILAGE PERICHONDRIUM OF THE
CC       DEVELOPING VERTEBRATE.


CC   -!- DEVELOPMENTAL STAGE: AT E9.5 EMBRYOS, EXPRESSED IN A LIMITED
CC       REGION OF THE NEUROEPITHELIUM AND ALSO IN THE TEMPORAL HALF OF THE
CC       PRIMARY OPTIC CUP AND THE OPTIC STALK. AT E10.5, SEEN IN THE
CC       HYPOTHALAMUS, TEMPORAL HALF OF THE OPTIC STALK, AND TEMPORAL
CC       HEMIRETINA. AT E12.5 AND E13.5 A HIGH EXPRESSION IS SEEN IN
CC       REGIONS OF CONDENSED MESENCHYME OF THE HEAD, AND AS
CC       NEUROEPITHELIAL CELLS BEGIN TO DIFFERENTIATE AND MIGRATE OUTWARD
CC       FROM THE VENTRICULAR ZONE, EXPRESSION DECLINES MARKEDLY. BY E16.5
CC       LEVELS ARE DIMINISHED AND RESTRICTED TO UNFUSED POCKETS ALONG THE
CC       EXHAUSTED VENTRICULAR ZONE.
 

There are also a number of keywords which are specifically associated with proteins involved in development. Keywords are contained in the 'KW' line of a Swiss-Prot entry. These include 'Developmental protein' which is a general keyword describing proteins involved in the developmental process. There are also a number of more specific keywords such as: 'Neurogenesis' which is used for proteins involved in the formation of the nervous system during embryonic development; 'Differentiation' which is used for proteins involved in differentiation of the embryo; 'Homeobox' which is used for those proteins which contain a homeobox domain; 'Gastrulation' which is used for proteins which play a role during gastrulation; and 'Myogenesis' which is used for proteins involved in the differentiation and development of muscle. The complete list of the keywords in use in Swiss-Prot can be found at http://www.expasy.ch/txt/keywlist.txt.

 

As can be seen, the Swiss-Prot database and its supplement, TrEMBL, provide a useful and valuable resource in the context of modern developmental research. The databases can be accessed at http://www.ebi.ac.uk/uniprot/Documentation/index.html#SwissProt and http://www.expasy.ch/sprot/sprot-top.html.

 

References:
 

Bairoch A, Apweiler R. 1999. The Swiss-Prot protein sequence data bank and its supplement TrEMBL in 1999. Nucl Acids Res 27:49-54.

Blake JA, Richardson JE, Davisson MT, Eppig JT and the Mouse Genome Database Group. 1999. The Mouse Genome Database (MGD): genetic and genomic information about the laboratory mouse. Nucl Acids Res 27:95-98.

Stoesser G, Tuli MA, Lopez R, Sterk P. 1999. The EMBL Nucleotide Sequence Database. Nucl Acids Res 27:18-24.

 

Figure 1: Portion of the index of Swiss-Prot entries which contain cross-references to MGD.

Figure 2: Example of a Swiss-Prot entry for a mouse developmental protein, myogenic factor 5. Abbreviations used in entry: ID-identification, AC-accession number, DT-date, DE-description, GN-gene name, OS-organism species, OC-organism classification, RN-reference number, RP-reference position (N.A.-nucleic acid), RC-reference comment, RX-reference cross-reference, RA-reference authors, RL-reference location, CC-comments/notes, DR-data bank cross-reference, KW-keywords, FT-feature table, SQ-sequence data. Underlined text indicates hyperlinking through WWW interface.



spacer
spacer