EMBL Nucleotide Sequence Database: Release Notes

      Release 59, June 1999

         EMBL - European Bioinformatics Institute                    Telephone: +44-1223-494444  Telefax: +44-1223-494468
         Wellcome Trust Genome Campus, Hinxton                    Electronic mail: DataLib@EBI.AC.UK
         Cambridge CB10 1SD,  United Kingdom                        URL: http://www.ebi.ac.uk

CONTENTS 1 RELEASE 59

The EMBL Nucleotide Sequence Database was frozen to make Release 59 on 08-June-1999. The release contains 3,952,878 sequence entries comprising 2,924,568,545 nucleotides. This represents an increase of about 24% over Release 58. A breakdown of Release 59 by division is shown below:
 

Division 

Entries 

Nucleotides 

Bacteriophage
1,457
3,242,798 
ESTs
2,516,840
          972,399,720 
Fungi
25,253
56,724,027
GSSs
759,940
          383,333,922 
HTG
3,011
372,091,224
Human
93,487
446,649,858
Invertebrates
37,526
161,010,283
Organelles
43,497
37,121,772
Other Mammals
19,734
18,338,028
Other Vertebrates
17,614
20,634,238
Patents
 137,552
43,517,073
Plants
38,071 
97,391,542
Prokaryotes
61,364 
145,119,922
Rodents
45,750
67,301,115
STSs
75,797
26,590,343
Synthetic
3,168
7,291,368
Unclassified
1,596
1,834,854
Viruses
71,221
63,976,458
total
3,952,878
2,924,568,545
1.1 'gzip' Compression Of Release 59 Files

Release 59 files distributed via the EBI FTP site or CD-ROM in compressed form use "gzip" compression instead of the previous Unix compress. Files are named with the suffix ".gz" instead of ".Z" and can be unpacked using "gunzip". "gzip" significantly descreases the size of the compressed files thus reducing network traffic and download time. Daily and weekly update files will also be "gzipped" from release 59 onwards. GCG version 8.1 .seq and .ref files (compatible with GCG 9.1 and GCG 10.0) are also gzip compressed. These files can be found on the FTP server in ftp://ftp.ebi.ac.uk/pub/databases/embl/release/gcg/.

If you need help locating gzip/gunzip software for your operating system, please contact support@ebi.ac.uk.
 

1.2 Cross-reference Information

Interconnectivity between the nucleic acid database and other related databases is becoming an essential prerequisite for utilising the wealth of information becoming available. Currently, the database contains over 4,338,600 cross-references. Links to a growing list of external databases will be expanded allowing integration with specialised data collections. These will include protein databases, species-specific databases, taxonomy databases as well as other specialised data collections. The WWW-based sequence retrieval system (SRS) will enable users to easily navigate between cross-referenced database entries. We are currently making an effort to update the database X-reference information taking into account the new protein_identifier format. Cross- reference information to SGD, CPGISLE and TFD is not available in Release 59.

1.3 Authorin Submission Tool Phased Out

Sequin has replaced Authorin as the stand-alone submission tool. Authorin is no longer available from the EBI. Authorin submissions are no longer accepted as of this date. Instead, please use the enhanced submission tools Webin and Sequin, described below.

1.4 Database Files

In order to keep the size of the data files within reasonable limits for handling purposes, additional division files will be added in subsequent releases as appropriate.

1.5 EST Database Files

In order to keep the size of the data files within reasonable limits for handling purposes, we have split the EST division into 26 files. (EST1.DAT - EST26.DAT).

1.6 GSS Database Files

In order to keep the size of the data files within reasonable limits for handling purposes, the GSS division has been split into 8 files (GSS1.dat - GSS8.dat).

1.7 HUM Database Files

In order to keep the size of the data files within reasonable limits for handling purposes, the HUM division has been split into 4 files (HUM1.dat - HUM4.dat). Additional files will be added in subsequent releases as appropriate.

2 FORTHCOMING CHANGES

2.1 Genome Representation

An experimental directory representing genome data has been available from the EBI anonymous FTP server in directory

/pub/databases/embl/genomes

including a number of files for each genome: a file listing construct information and all accession numbers relevant to the project, the complete single entry in EMBL format (DNA and features), the complete DNA sequence in FASTA format, e.g.

 in ftp://ftp.ebi.ac.uk/pub/databases/embl/genomes/Eubacteria/bsubtilis/

README
bsubt.con
bsubt.embl
bsubt.embl.Z
bsubt.fasta
bsubt.fasta.Z

Efforts are now undertaken to develop a new database division (CON) to represent these complete genomes, or other long sequences constructed from segment entries. The CON division will contain construct information, accession numbers and sequence locations involved in building the constructs.

Additionally, the according complete entries including descriptive information , references, features, DNA sequence will be linked, searchable and retrievable through SRS and available for BLAST and FASTA homology searching.

2.2 New Nucleotide And Protein Sequence Identifiers

Nucleotide Sequence Identifiers (NID) and Protein Sequence Identifiers (PID) until now have been represented in the NI linetype of the EMBL flat-file and the /db_xref qualifier, respectively. A new form of sequence identifiers for both nucleotide (SV) and protein sequences (/protein_id) has been introduced at Release 58.

New Sequence Version Example:

    SV     AJ000012.1

The new nucleotide sequence identifier is of the form of 'Accession.Version' (eg, AJ000012.1), where the accession number part will be stable, but the version part will be incremented when the sequence changes.

New Protein Identifier Example:

    /protein_id="CAA03857.1"

The new protein sequence identifier consists of a stable ID portion (3+5 format with 3 position letters and 5 numbers) plus a version number after a decimal point.

Transition  Phase
During a transition phase (6 months) both the old (NI, PID) and new forms (SV and /protein_id)) of  identifiers will be provided (see example below). Starting from Release 60 (September 1999) only the new form of identifiers will be included.
 

3 SEQUENCE SUBMISSION SYSTEMS

3.1 Checking Sequence Data For Vector Contamination

We urge submitters to remove vector contamination from sequence data before submitting to the database. To assist submitters the EBI is providing a Vector Screening Service using the latest implementation of the BLAST algorithm and a special sequence databank known as EMVEC. EMVEC is an extraction of sequences from the SYNthetic division of EMBL containing more than 2000 sequences commonly used in cloning and sequencing experiments. EMVEC is by no means a complete vector databank but EBI believes it is representative of the kind of material used in modern sequencing and should be useful to submitters. The databank will be updated with each release of EMBL and made publicly available on the EBI's ftp (ftp.ebi.ac.uk) server for those who wish to have it.

The interactive WWW service can be found at:

http://www.ebi.ac.uk/submission/webin.html
http://www.ebi.ac.uk/blastall/vectors.html

The results will list sequences producing significant alignments and associated information like vector name, score, alignment etc

3.2 WebIn - WWW Sequence Submission System

WebIn is the preferred WWW Sequence Submission System for submitting nucleotide sequence data and associated biological information to the EMBL Nucleotide Sequence Database at the European Bioinformatics Institute (EBI).  To access WebIn at the EBI please use the following URL:

http://www.ebi.ac.uk/submission/webin.html

Database entries created by the new WWW submission tool and submitted to the EMBL Nucleotide Sequence Database at the EBI will be exchanged and shared among the International Collaboration of Nucleotide Sequence Databases (DDBJ/EMBL/GenBank).

WebIn guides the user through a sequence of WWW forms allowing the submission of sequence data and descriptive information in an interactive and easy way. All the information required to create a database entry will be collected during this process:

1 Submitter Information
2 Release Date Information
3 Sequence Data, Description and Source Information
4 Reference Citation Information
5 Feature Information (e.g. coding regions, regulatory signals etc.)

EBI staff will process data submissions within 2 working days and send the database accession number(s) assigned to your data to your e-mail address.

3.3 Bulk Submissions

With the aim to make bulk sequence submission highly efficient and less time consuming for the submitters, a prototype of a new web-based bulk sequence submission system can now be accessed from WEBIN. Authors planning to submit a large number of similar sequences (i.e.,>25) are presented with an option for "Bulk WebiN Submission". When choosing the bulk path, submitters carry on the usual WEBIN submission procedure until having finished a first and single representative sequence. During the submission process database staff will interactively assist in making the submission of this specific data as convenient as possible, thus saving the author the time and effort required to complete numerous submission events individually.

Alternatively, authors planning to submit very large numbers of similar sequences should contact the database
before submitting the data. Database staff will then assist in making the submission of this data as convenient as
possible, thus saving the author the time and effort required to complete numerous submission events individually.
When contacting database staff, authors should indicate the number of sequences they plan to submit. Database
staff will create series of templates and communicate these to the author for completion with just the information
unique to each sequence required. These templates, once resubmitted, will then be processed en masse by
database curators.

Please contact database staff if you require further information.

e-mail: datasubs@ebi.ac.uk

Tel:   +44-1223-494499
Fax:  +44-1223-494472
 

3.4 SEQUIN - Stand-alone Submission Program

Sequin is the multi-platform (Mac/PC/Unix) stand-alone software tool developed by the NCBI for submitting entries to the EMBL, GenBank, or DDBJ sequence  databases. The Sequin program, along with detailed downloading and installation instructions plus general information are available from the EBI via WWW browser and anonymous FTP.

http://www.ebi.ac.uk/Sequin/
ftp://ftp.ebi.ac.uk/pub/software/sequin/

3.5 Sequence Alignment Submissions

The EBI accepts submissions of alignment data (e.g. from phylogenetic and population analysis etc..) of both nucleotide or amino-acid sequences. This data is assigned an alignment number (e.g. ds38200) by database staff which is then communicated to the submitter. We suggest that this number is quoted in the resulting publication.

Alignment data and associated information are made available via EBI's network servers (see below).

ALIGNMENT FORMATS:

As well as your alignment data we require information describing your alignment (see table below) Please provide information for all fields.
Description Field Information required
TITLE: Title of alignment
SUBMITTER: Name, Affiliation, Phone, Fax, Email
RELEASE DATE: Public Immediately / if Confidential please provide hold date
CITATION: If known please provide complete Author list, Title, Journal, Year of publication, Page numbers
ALIGNMENT METHOD: Method of alignment and format submitted, parameters of alignment sequences used (if appropriate)
DESCRIPTION OF SYMBOLS: e.g. Gaps indicated by a dash '-'
DESCRIPTION OF ALIGNMENT: Describe sequences aligned, including accession numbers (if known) and abbreviation of clones or taxon used in alignment file. If your alignment contains sequences derived from multiple taxoonomic sources, please provide the full name of each organism
FILE FORMAT: We are currently updating and improving both the access to and alignment output of this archive due to an increase in the submission of alignment data. The compilation of text files and the issue of format standardisation are undergoing review and are being discussed by the database staff, external users and experts in the field. 

We suggest submission in STANDARD ALIGNMENT FORMATS eg. (NEXUS, PHYLIP, ClustalW etc) or Sequin output.

A sample alignment in NEXUS format can be viewed at ftp://ftp.ebi.ac.uk/pub/databases/embl/align/ds32096.dat

NOTE 1: Alignments can be created within Sequin or imported into Sequin from files in a standard alignment format like NEXUS or PHYLIP.

NOTE 2: If reporting new primary sequence data, we suggest that you submit the complete individual sequence files (e.g. via Sequin or Webin), in order to include the sequence data as individual entries in the database. If gaps have been introduced for the alignment, please leave them out when sending the individual sequence files.

SENDING ALIGNMENT DATA to the EMBL Nucleotide Sequence Database
Sequence Alignment Data can be sent to the Nucleotide Sequence Database by Electronic mail to DATASUBS@EBI.AC.UK

ACCESSING ALIGNMENT DATA
Alignment data and additional information are available via the EBI servers:

EBI WWW server:         ftp://ftp.ebi.ac.uk/pub/databases/embl/align/
EBI FTP server:                by anonymous FTP from FTP.EBI.AC.UK in directory /pub/databases/embl/align
EBI File server:                 by sending an e-mail message to netserv@ebi.ac.uk including the line HELP ALIGN or
                                         GET  ALIGN:DS8200.DAT
 

3.6 Further Submission Information
3.6.1 New Annotation Guides

To help and guide submitters in annotating their sequences, two new online guides are now available via hyperlinks  from within WEBIN: EMBL Annotation Examples and EMBL Features and Qualifiers. The annotation examples consist of a list of EMBL approved feature table annotations for common biological sequences. The EMBL Features and Qualifiers is a complete list of feature table key and qualifier definitions providing detailed descriptions, mandatory and optional qualifiers and usage examples.

For further information on submission of sequence data to the EMBL Nucleotide Sequence Database please access:

http://www.ebi.ac.uk/emblSubmission/index.html

or contact database staff at:

EMBL Nucleotide Sequence Submissions
e-mail: datasubs@EBI.AC.UK
telephone: +44-1223-494499
telefax: +44-1223-494472

4 CITING THE EMBL NUCLEOTIDE SEQUENCE DATABASE

We encourage authors to include a reference to the EMBL Database in publications related to their research.

When citing data in the EMBL Database, we suggest to give the according primary accession number(s) and the publication in which the sequence first appeared. For unpublished data, we suggest to contact the original submitters for recent publication information or revisions of the data.

We suggest to also provide a reference for the EMBL Database itself. Our recent publication in Nucl. Acids Res., 1999, Vol. 27 (1), 18-24., which describes the EMBL database, should be cited:

Stoesser, G., Tuli, M.A., Lopez, R., and Sterk, P.. The EMBL Nucleotide Sequence Database Nucl. Acids Res., 27:18-24(1999)

Example: The numbers in parentheses refer to the REFERENCE in the EMBL database entry, and to the EMBL citation above.

"Sequence entry X56734 (1) has been retrieved from the EMBL Database (2) and showed significant sequence similarity to ..."

(1) Oxtoby E., et al., Plant Mol. Biol. 17:209-219(1991).
(2) Stoesser, G., et al., Nucl. Acids Res. 27:18-24(1999)

5 EBI NETWORK SERVICES

5.1 Electronic Mail Server

Computer users with access to Internet (directly or via a gateway) can obtain copies of database entries, documentation or the data submission form, by sending commands to a file server running on the computer systems at EBI. New and updated EMBL nucleotide sequence entries are made available on the server on a daily basis.

To use this facility, send file server commands (as electronic mail) to the address Netserv@EBI.AC.UK. Each line of the mail message should consist of a single file server request.

The most important file server request, to get started, is:

HELP

If the file server receives this command, it will return a help file to the sender, explaining in some detail how to use the facility. For example, to request a copy of the data submission form and the nucleotide sequence with accession number X12399, use the commands:

GET DOC:DATASUB.TXT
GET NUC:X12399

The file server offers various other services, (eg., access to nucleotide and protein sequence data, protein structure data, software), details of which are provided in the HELP file.

5.2 Anonymous FTP Server

An alternative method of accessing the EBI archives is to use the Internet file transfer protocol (ftp). Researchers with direct access to the Internet can use the FTP program on their local machine to connect to the host FTP.EBI.AC.UK and enter the user name "anonymous" and their email address as password. The directory pub/help contains detailed information about the data available from the EBI anonymous FTP server which includes the complete EMBL Nucleotide Sequence Database releases as well as daily and weekly updates and a cumulative update file (in UNIX-compressed format) in the following directories:

EMBL quarterly release:   pub/databases/embl/release
EMBL updates:   pub/databases/embl/new

5.3 World Wide Web (WWW) Server

The EBI operates a WWW server with URL http://www.ebi.ac.uk/ which gives access to information about the EBI and it's products and services. Nucleotide sequences can be retrieved by a simple query by accession number, or more complex queries can be contructed using an SRS WWW databank browser. Nucleotide sequences can also be submitted to the database using the interactive submission system WebIn at URL:

http://www.ebi.ac.uk/emblSubmission/webin.html

5.4 Sequence Search Servers

The EBI offers two network servers for sequence similarity searches via electronic mail or interactive WWW forms:
FASTA based on W. Pearson's FASTA algorithm. Allows local similarity searches of protein and nucleotide sequence databases. Send "help" to fasta@ebi.ac.uk  or use URL http://www.ebi.ac.uk/fasta33/
BLAST based on the NCBI and WU-BLAST software Send "help" to blast@ebi.ac.uk   or use URL http://www.ebi.ac.uk/blast2/
BLITZ BLITZ allows very fast searches of protein sequence databases for local similarities using an exhaustive Smith-Waterman matching algorithm. Compugen's BIC_SW software is running on a Biocellerator (BIC-2) Send "help" to Blitz@EBI.AC.UK  or use URL  http://www.ebi.ac.uk/bic_sw/
6 DISTRIBUTION FILES

6.1 Documentation

The documentation files are in text format ending  with a file extension of '.txt'.
(relnotes.txt, usrman.txt)

6.2 SRS Indices

SRS indices can be found on the FTP server in the srs directory
ftp://ftp.ebi.ac.uk/pub/databases/embl/release/srs/.
Please read the README file for details.

6.3 Release 59 Files

The release contains the files shown below, in the order listed. File sizes are given as numbers of records.

File Number 

File Name 

Description 

Number of Records 

1 USRMAN.TXT User Manual       1550
2 RELNOTES.TXT Release Notes (this document)       1027
3 DATASUB.TXT Data Submission Form         330
4 DATASUB.DOC Data Submission Documentation         311
5 UPDATE.DOC Data Update Form           86
6 FTABLE.DOC Feature Table Documentation         447
7 ACNUMBER.NDX Accession Number Index 3996712
8 DIVISION.NDX Division Index           23
9 SHORTDIR.NDX Short Directory Index 9221499
10 SPECIES.NDX Species Index   207197
11 CITATION.NDX Citation Index   406936
12 KEYWORD.NDX Keyword.index 1541681
13 EST1.DAT EST Sequences 7062409
14 EST2.DAT EST Sequences 7033968 
15 EST3.DAT EST Sequences 7113147 
16 EST4.DAT EST Sequences 6967045
17 EST5.DAT EST Sequences 7021242
18 EST6.DAT EST Sequences 7114459
19 EST7.DAT EST Sequences 6823278
20 EST8.DAT EST Sequences 6851996
21 EST9.DAT EST Sequences 6705558
22 EST10.DAT EST Sequences 7008292
23 EST11.DAT EST Sequences 7102351
24 EST12.DAT EST Sequences 7055036
25 EST13.DAT EST Sequences 6448557
26 EST14.DAT EST Sequences 5815070
27 EST15.DAT EST Sequences 5798907
28 EST16.DAT EST Sequences 6853928
29 EST17.DAT EST Sequences 5822371
30 EST18.DAT EST Sequences 5640234
31 EST19.DAT EST Sequences 5633560
32 EST20.DAT EST Sequences 6671731
33 EST21.DAT EST Sequences 6501889
34 EST22.DAT EST Sequences 6955080
35 EST23.DAT EST Sequences 7167734
36 EST24.DAT EST Sequences 7038660
37 EST25.DAT EST Sequences 6324337
38 EST26.DAT EST Sequences 1094595
39 FUN.DAT Fungi Sequences 2641735
40 GSS1.DAT Genome Survey Sequences 6406607 
41 GSS2.DAT Genome Survey Sequences 6293119 
42 GSS6.DAT Genome Survey Sequences 6455019
43 GSS4.DAT Genome Survey Sequences 6828223 
44 GSS5.DAT Genome Survey Sequences 6625917 
45 GSS6.DAT Genome Survey Sequences 6737196
46 GSS7.DAT Genome Survey Sequences 6471607
47 GSS8.DAT Genome Survey Sequences 3566273
48 HTG.DAT High Throughput Genome Sequences 6446495 
49 HUM1.DAT Human Sequences 6715037 
50 HUM2.DAT Human Sequences 3622945 
51 HUM3.DAT Human Sequences 2150782
52 HUM4.DAT Human Sequences 1585576
53 INV.DAT Invertebrate Sequences 5367594 
54 MAM.DAT Other Mammal Sequences 1444791 
55 ORG.DAT Organelle Sequences 3208680
56 PATENT.DAT Patent Sequences 5710464 
57 PHG.DAT Bacteriophage Sequences   180433
58 PLN.DAT Plant Sequences 4239644
59 PRO1.DAT Prokaryote Sequences 6167851
60 PRO2.DAT Prokaryote Sequences 1094457
61 ROD.DAT Rodent Sequences 3909502
62 STS.DAT STS Sequences 4827205
63 SYN.DAT Synthetic Sequences   315717
64 UNC.DAT Unclassified Sequences   112675
65 VRL.DAT Viral Sequences 5495692
66 VRT.DAT Other Vertebrate Sequences 1366461
APPENDIX A

DATABASE GROWTH TABLE

The following table shows the growth of the EMBL Nucleotide Sequence Database at each release.

Release 

Month 

Entries 

Nucleotides 

1 06/1982         568         585433
2 04/1983         811       1114447
3 12/1983       1481       1654863
4 08/1984       1698       2147205
5 04/1985       2378       2874493
6 08/1985       4835       4567592
7 12/1985       5789       5622638
8 04/1986       6395       6353040
9 09/1986       7630       7813214
10 12/1986       8817       9766948
11 04/1987     11621     12189783
12 07/1987     12706     13638061
13 10/1987     14397     16023478
14 01/1988     15344     17272160
15 05/1988     17961     20318442
16 08/1988     19592     22625941
17 11/1988     20695     24211054
18 02/1989     22938     27249830
19 05/1989     24365     29066676
20 08/1989     26223     31240948
21 11/1989     28679     34748087
22 02/1990     31508     38165786
23 05/1990     34902     42923803
24 08/1990     37784     47354438
25 11/1990     41580     52900354
26 02/1991     43745     55859549
27 05/1991     46871     59915244
28 09/1991     54558     70448052
29 12/1991       5765     75400487
30 03/1992     63378     83574342
31 06/1992     72481     94390065
32 09/1992     79377   101292310
33 12/1992     89100   111413979
34 03/1993     99591   121420828
35 06/1993   108973   131880111
36 09/1993   127933   145401156
37 12/1993   146576   158171400
38 03/1994   167777   177550115
39 06/1994   182615   192195819
40 09/1994   209352   211017104
41 12/1994   230950   226259607
42 03/1995   303206   262559786
43 06/1995   420111   315840053
44 09/1995   506190   363273777
45 12/1995   622566   427620278
46 03/1996   701246   473691480
47  06/1996   827174   550739395
48 09/1996   928067   608931850
49 12/1996 1047263   696183789
50 03/1997 1187455   789755858
51 06/1997 1432941   931351601
52 10/1997 1787004 1181167498
53 12/1997 1917868 1281391651
54 03/1998 2125225 1427634373
55 06/1998 2330040 1607673907
56 09/1998 2689618 1904091473
57 12/1998 3046471 2164718256
58 03/1999 3272064 2355200790
59 06/1999 3952878 2924568545