UniProtKB/TrEMBL - Old Release Notes

N.B. From release 27, the UniProtKB/TrEMBL release notes are included in the UniProt release notes.

                UniProt/TrEMBL Release Notes

                   Release 27, 5th July 2004

    EMBL Outstation
    European Bioinformatics Institute (EBI)
    Wellcome Trust Genome Campus
    Cambridge CB10 1SD
    United Kingdom

    Telephone: (+44 1223) 494 444
    Fax: (+44 1223) 494 468
    Electronic mail address:
    WWW server:

    Swiss Institute of Bioinformatics (SIB)
    Centre Medical Universitaire
    1, rue Michel Servet
    1211 Geneva 4

    Telephone: (+41 22) 702 50 50
    Fax: (+41 22) 702 58 58
    Electronic mail address: 
    WWW server:

    Protein Information Resource (PIR)
    Georgetown University Medical Center
    3900 Reservoir Road, NW
    Box 571455
    Washington, DC 20057-1455
    United States of America

    Telephone: (+1 202) 687 1039
    Fax: (+1 202) 687 0057)
    Electronic mail address:
    WWW server:


    UniProt/TrEMBL has been prepared by:

    o  Claire O'Donovan, Maria Jesus Martin, Yasmin Alam-Faruque,
       Nicola Althorpe, Daniel Barrell, Wei mun Chan, Paul Browne,
       Kirill Degtyarenko, Ruth Eberhardt, Gill Fraser, Alexander
       Fedetov, Rodrigo Fernandez, John Garavelli, Andre Hackmann, 
       Alan Horne, Julius Jacobsen, Alexander Kanapin, Youla
       Karavidopoulou, Paul Kersey, Ernst Kretschmann, Kati Laiho,
       Minna Lehvaslaiho, Michele Magrane, Virginie Mittard, Nicola
       Mulder, John F. O'Rourke, Sandra Orchard, Astrid Rakow, 
       Mark Rynbeek, Sandra van den Broek, Eleanor Whitfield, Allyson
       Williams and Rolf Apweiler at the EMBL Outstation - European
       Bioinformatics Institute (EBI) in Hinxton, UK.
    o  Amos Bairoch, Alexandre Gattiker, Karine Michoud, Catherine
       Rivoire, Nicole Redaschi and Sandrine Pilbout at the Swiss
       Institute of Bioinformatics in Geneva, Switzerland.

    Copyright Notice

    UniProt/TrEMBL copyright (c) 2004 EMBL-EBI
    This manual and the database it accompanies may be copied and
    redistributed freely, without advance permission, provided
    that this copyright statement is reproduced with each copy.


    If you want to cite UniProt/TrEMBL in a publication please use  
    the following reference:

    Apweiler R., Bairoch A., Wu C.H., Barker W.C., Boeckmann B., Ferro 
    S., Gasteiger E., Huang H., Lopez R., Magrane M., Martin M.J., 
    Natale D.A., O'Donovan C., Redaschi N. and Yeh L.L.
    UniProt: the Universal Protein knowledgebase
    Nucleic Acids Res. 32: D115-D119 (2004)

 1. Introduction

UniProt/TrEMBL is a computer-annotated protein sequence database 
complementing the UniProt/Swiss-Prot database. Together they constitute
the UniProt Knowledgebase. The DDBJ/EMBL/GenBank nucleotide sequence 
databases' CDS translations, the sequences of PDB structures, and 
directly sequenced peptides extracted from the literature or submitted 
directly to UniProt are used by default as the raw material for the 
UniProt Knowledgebase. However, some data from DDBJ/EMBL/GenBank including 
most of the Whole Genome Shotgun (WGS) data, CDS translations leading to 
small fragments or not coding for real proteins, synthetic sequences, 
non-germline Immunoglobulins and T-cell receptors, and most patent 
application sequences are actively excluded from the Knowledgebase. 
Having this data in the Knowledgebase would pollute the database 
with highly unstable and low-quality data. However, we do provide all 
publicly available protein sequences in the UniProt archive (UniParc)
( UniParc sequences from other UniParc source 
records identified by the UniProt curators as important sequences 
missing in the Knowledgebase are also used to create new UniProt 
Knowledgebase records. This process ensures that the UniProt Knowledgebase 
is not missing any important sequences available in the protein sequence 
repositories, but minimises the amount of unstable and low quality data in 
the Knowledgebase.

 2. Why a complement to UniProt/Swiss-Prot?

The ongoing gene sequencing and mapping projects have dramatically
increased the number of protein sequences to be incorporated into 
UniProt/Swiss-Prot. We do not want to dilute the quality standards of 
UniProt/Swiss-Prot by incorporating sequences without proper sequence 
analysis and annotation but we do want to make the sequences available 
as quickly as possible. UniProt/TrEMBL achieves this goal and is a major 
step in the process of speeding up subsequent upgrading of annotation to 
the standard UniProt/Swiss-Prot quality. 

 3. The Release

This UniProt/TrEMBL release has been produced in synch with 
UniProt/Swiss-Prot release 44 and together they comprise the UniProt 
Knowledgebase release 2.0. It was created from the EMBL Nucleotide 
Sequence Database release 79 and updates until the 18-June-2004 and 
contains 1'333'917 entries and 413'323'560 amino acids.

UniProt/TrEMBL is organized in subsections:

arc.dat (Archaea):                          4245 entries
arp.dat (Complete Archaeal proteomes):     33050 entries 
fun.dat (Fungi):                           41959 entries
hum.dat (Human):                           49176 entries
inv.dat (Invertebrates):                  147306 entries
mam.dat (Other Mammals):                   18352 entries
mhc.dat (MHC proteins):                    10528 entries
org.dat (Organelles):                     112691 entries
phg.dat (Bacteriophages):                  13750 entries
pln.dat (Plants):                         116371 entries
pro.dat (Prokaryotes):                    169966 entries
prp.dat (Complete Prokaryotic Proteomes): 330392 entries
rod.dat (Rodents):                         47097 entries
unc.dat (Unclassified):                      963 entries
vrl.dat (Viruses):                        124972 entries
vrt.dat (Other Vertebrates):               30294 entries
vrv.dat (Retroviruses):                   116571 entries

275'585 new entries have been integrated in UniProt/TrEMBL. 

More statistics for the UniProt/TrEMBL release are available at

In the document delac_tr.txt, you will find a list of all accession 
numbers which were previously present in UniProt/TrEMBL, but which 
have now been deleted from the database.

 4. Format differences between UniProt/Swiss-Prot and UniProt/TrEMBL

The format and conventions used by UniProt/TrEMBL follow as closely 
as possible that of UniProt/Swiss-Prot. Hence, it is not necessary 
to produce an additional user manual and extensive release notes 
for UniProt/TrEMBL. The information given in the UniProt/Swiss-Prot 
release notes and user manual are in general valid for UniProt/TrEMBL. 
The differences are mentioned below.

The general structure of an entry is identical in both databases. 

The data class used in UniProt/TrEMBL (in the ID line) is always 
'PRELIMINARY',whereas in UniProt/Swiss-Prot it is always 'STANDARD'.

Differences in line types:

The ID line (IDentification):

The entry name used in UniProt/TrEMBL is the same as the Accession 
Number of the entry.

The DT line (DaTe)

The format of the DT lines that serve to indicate when an entry was
created and updated are identical to that defined in UniProt/Swiss-Prot; 
but the DT lines in UniProt/TrEMBL refer to the UniProt/TrEMBL release. 
The difference is shown in the example below.

    DT lines in a UniProt/Swiss-Prot entry:

    DT   01-JAN-1988 (Rel. 06, Created)
    DT   01-JUL-1989 (Rel. 11, Last sequence update)
    DT   01-AUG-1992 (Rel. 23, Last annotation update)

    DT lines in a UniProt/TrEMBL entry:

    DT   01-NOV-1996 (TrEMBLrel. 01, Created)
    DT   01-NOV-1999 (TrEMBLrel. 12, Last sequence update)
    DT   01-MAR-2004 (TrEMBLrel. 26, Last annotation update)

 5. Bi-Weekly incremental UniProt Knowledgebase releases 

5.1 UniProt Knowledgebase

In addition to full releases, we also provide biweekly two compressed 
files: uniprot.sprot.dat.gz and uniprot.trembl.dat.gz at allowing users access 
to the latest data.

5.2 XML

A version of the UniProt Knowledgebase in XML format has been developed 
and is provided with this release. More information is available at and the data can be 
downloaded from

We would welcome any feedback from the user community.

5.3 Varsplic Expand

We also provide Varsplic Expand which is a program to generate
"expanded" sequences from UniProt Knowledgebase records i.e. sequences 
including the variants specified by the varsplic, variant and conflict 
annotations. New records are produced in either pseudo-UniProt/Swiss-Prot 
or FASTA format for each specified variant. More information and the data 
is available at

 6. Access/Data Distribution

The UniProt/TrEMBL release 27 is available at:

The biweekly UniProt Knowledgebase release is available for searches and 
download from

The UniProt Knowledgebase release is also available on CD-ROM from the EBI.

 7. General announcements and Forthcoming changes

7.1 Recent and Forthcoming changes documentation for users

We have introduced two new resources for users to enable us to communicate 
effectively between releases about what is new in the UniProt Knowledgebase 
and what is planned for the future.
These are available at:

7.2 TrEMBL enhancements

This release of TrEMBL has been produced from a new relational
database system. This new system enables the biweekly synchronization of 
UniProt/TrEMBL with it's source EMBL/DDBJ/GenBank nucleotide sequence 
databases. It has also facilitated the integration of various bioinformatic 
tools to enhance the UniProt/TrEMBL annotation. As a result, this release of 
the database has significant annotation differences with regards to previous 
releases and we are committed to further raising the annotation standards. 
We welcome feedback from the user community.