The EMBL Nucleotide Sequence Database was frozen to make Release 59
on 08-June-1999. The release contains 3,952,878 sequence entries
comprising 2,924,568,545 nucleotides. This represents an increase
of about 24% over Release 58. A breakdown of Release 59 by division is
shown below:
Division |
Entries |
Nucleotides |
| Bacteriophage |
1,457
|
3,242,798
|
| ESTs |
2,516,840
|
972,399,720 |
| Fungi |
25,253
|
56,724,027
|
| GSSs |
759,940
|
383,333,922
|
| HTG |
3,011
|
372,091,224
|
| Human |
93,487
|
446,649,858
|
| Invertebrates |
37,526
|
161,010,283
|
| Organelles |
43,497
|
37,121,772
|
| Other Mammals |
19,734
|
18,338,028
|
| Other Vertebrates |
17,614
|
20,634,238
|
| Patents |
137,552
|
43,517,073
|
| Plants |
38,071
|
97,391,542
|
| Prokaryotes |
61,364
|
145,119,922
|
| Rodents |
45,750
|
67,301,115
|
| STSs |
75,797
|
26,590,343
|
| Synthetic |
3,168
|
7,291,368
|
| Unclassified |
1,596
|
1,834,854
|
| Viruses |
71,221
|
63,976,458
|
| total |
3,952,878
|
2,924,568,545
|
Release 59 files distributed via the EBI FTP site or CD-ROM in compressed form use "gzip" compression instead of the previous Unix compress. Files are named with the suffix ".gz" instead of ".Z" and can be unpacked using "gunzip". "gzip" significantly descreases the size of the compressed files thus reducing network traffic and download time. Daily and weekly update files will also be "gzipped" from release 59 onwards. GCG version 8.1 .seq and .ref files (compatible with GCG 9.1 and GCG 10.0) are also gzip compressed. These files can be found on the FTP server in ftp://ftp.ebi.ac.uk/pub/databases/embl/release/gcg/.
If you need help locating gzip/gunzip software for your operating system,
please contact support@ebi.ac.uk.
1.2 Cross-reference Information
Interconnectivity between the nucleic acid database and other related databases is becoming an essential prerequisite for utilising the wealth of information becoming available. Currently, the database contains over 4,338,600 cross-references. Links to a growing list of external databases will be expanded allowing integration with specialised data collections. These will include protein databases, species-specific databases, taxonomy databases as well as other specialised data collections. The WWW-based sequence retrieval system (SRS) will enable users to easily navigate between cross-referenced database entries. We are currently making an effort to update the database X-reference information taking into account the new protein_identifier format. Cross- reference information to SGD, CPGISLE and TFD is not available in Release 59.
1.3 Authorin Submission Tool Phased Out
Sequin has replaced Authorin as the stand-alone submission tool. Authorin is no longer available from the EBI. Authorin submissions are no longer accepted as of this date. Instead, please use the enhanced submission tools Webin and Sequin, described below.
In order to keep the size of the data files within reasonable limits for handling purposes, additional division files will be added in subsequent releases as appropriate.
In order to keep the size of the data files within reasonable limits for handling purposes, we have split the EST division into 26 files. (EST1.DAT - EST26.DAT).
In order to keep the size of the data files within reasonable limits for handling purposes, the GSS division has been split into 8 files (GSS1.dat - GSS8.dat).
In order to keep the size of the data files within reasonable limits for handling purposes, the HUM division has been split into 4 files (HUM1.dat - HUM4.dat). Additional files will be added in subsequent releases as appropriate.
An experimental directory representing genome data has been available from the EBI anonymous FTP server in directory
/pub/databases/embl/genomes
including a number of files for each genome: a file listing construct information and all accession numbers relevant to the project, the complete single entry in EMBL format (DNA and features), the complete DNA sequence in FASTA format, e.g.
in ftp://ftp.ebi.ac.uk/pub/databases/embl/genomes/Eubacteria/bsubtilis/
README
bsubt.con
bsubt.embl
bsubt.embl.Z
bsubt.fasta
bsubt.fasta.Z
Efforts are now undertaken to develop a new database division (CON) to represent these complete genomes, or other long sequences constructed from segment entries. The CON division will contain construct information, accession numbers and sequence locations involved in building the constructs.
Additionally, the according complete entries including descriptive information , references, features, DNA sequence will be linked, searchable and retrievable through SRS and available for BLAST and FASTA homology searching.
2.2 New Nucleotide And Protein Sequence Identifiers
Nucleotide Sequence Identifiers (NID) and Protein Sequence Identifiers (PID) until now have been represented in the NI linetype of the EMBL flat-file and the /db_xref qualifier, respectively. A new form of sequence identifiers for both nucleotide (SV) and protein sequences (/protein_id) has been introduced at Release 58.
New Sequence Version Example:
SV
AJ000012.1
The new nucleotide sequence identifier is of the form of 'Accession.Version'
(eg, AJ000012.1), where the accession number part will be stable, but the
version part will be incremented when the sequence changes.
New Protein Identifier
Example:
/protein_id="CAA03857.1"
The new protein sequence identifier consists of a stable ID portion
(3+5 format with 3 position letters and 5 numbers) plus a version number
after a decimal point.
Transition Phase
3.1
Checking Sequence Data For Vector Contamination
We urge submitters to remove vector contamination from sequence data
before submitting to the database. To assist submitters the EBI is providing
a Vector Screening Service using the latest implementation of the BLAST
algorithm and a special sequence databank known as EMVEC. EMVEC is an extraction
of sequences from the SYNthetic division of EMBL containing more than 2000
sequences commonly used in cloning and sequencing experiments. EMVEC is
by no means a complete vector databank but EBI believes it is representative
of the kind of material used in modern sequencing and should be useful
to submitters. The databank will be updated with each release of EMBL and
made publicly available on the EBI's ftp (ftp.ebi.ac.uk) server for those
who wish to have it.
The interactive WWW service can be found at:
http://www.ebi.ac.uk/submission/webin.html
The results will list sequences producing significant alignments and
associated information like vector name, score, alignment etc
3.2
WebIn - WWW Sequence Submission System
WebIn is the preferred WWW Sequence Submission System for submitting
nucleotide sequence data and associated biological information to the EMBL
Nucleotide Sequence Database at the European Bioinformatics Institute (EBI).
To access WebIn at the EBI please use the following URL:
http://www.ebi.ac.uk/submission/webin.html
Database entries created by the new WWW submission tool and submitted
to the EMBL Nucleotide Sequence Database at the EBI will be exchanged and
shared among the International Collaboration of Nucleotide Sequence Databases
(DDBJ/EMBL/GenBank).
WebIn guides the user through a sequence of WWW forms allowing the submission
of sequence data and descriptive information in an interactive and easy
way. All the information required to create a database entry will be collected
during this process:
1 Submitter Information
EBI staff will process data submissions within 2 working days and send
the database accession number(s) assigned to your data to your e-mail address.
With the aim to make bulk sequence submission highly efficient and less
time consuming for the submitters, a prototype of a new web-based bulk
sequence submission system can now be accessed from WEBIN. Authors planning
to submit a large number of similar sequences (i.e.,>25) are presented
with an option for "Bulk WebiN Submission". When choosing the bulk path,
submitters carry on the usual WEBIN submission procedure until having finished
a first and single representative sequence. During the submission process
database staff will interactively assist in making the submission of this
specific data as convenient as possible, thus saving the author the time
and effort required to complete numerous submission events individually.
Alternatively, authors planning to submit very large numbers
of similar sequences should contact the database
Please contact database staff if you require further information.
e-mail: datasubs@ebi.ac.uk
Tel: +44-1223-494499
3.4
SEQUIN - Stand-alone Submission Program
Sequin is the multi-platform (Mac/PC/Unix) stand-alone software tool
developed by the NCBI for submitting entries to the EMBL, GenBank, or DDBJ
sequence databases. The Sequin program, along with detailed downloading
and installation instructions plus general information are available from
the EBI via WWW browser and anonymous FTP.
http://www.ebi.ac.uk/Sequin/
3.5
Sequence Alignment Submissions
The EBI accepts submissions of alignment data (e.g. from phylogenetic
and population analysis etc..) of both nucleotide or amino-acid sequences.
This data is assigned an alignment number (e.g. ds38200) by database staff
which is then communicated to the submitter. We suggest that this number
is quoted in the resulting publication.
Alignment data and associated information are made available via EBI's
network servers (see below).
ALIGNMENT FORMATS:
As well as your alignment data we require information describing your
alignment (see table below) Please provide information for all fields.
We suggest submission in STANDARD ALIGNMENT FORMATS eg. (NEXUS, PHYLIP,
ClustalW etc) or Sequin output. NOTE 1: Alignments can be created within Sequin or imported into
Sequin from files in a standard alignment format like NEXUS or PHYLIP.
NOTE 2: If reporting new primary sequence data, we suggest that
you submit the complete individual sequence files (e.g. via Sequin or Webin),
in order to include the sequence data as individual entries in the database.
If gaps have been introduced for the alignment, please leave them out when
sending the individual sequence files.
SENDING ALIGNMENT DATA to the EMBL Nucleotide Sequence Database
ACCESSING ALIGNMENT DATA
EBI WWW server: ftp://ftp.ebi.ac.uk/pub/databases/embl/align/
3.6
Further Submission Information
To help and guide submitters in annotating their sequences, two new
online guides are now available via hyperlinks from within WEBIN:
EMBL Annotation Examples and EMBL Features and Qualifiers. The annotation
examples consist of a list of EMBL approved feature table annotations for
common biological sequences. The EMBL Features and Qualifiers is a complete
list of feature table key and qualifier definitions providing detailed
descriptions, mandatory and optional qualifiers and usage examples.
For further information on submission of sequence data to the EMBL Nucleotide
Sequence Database please access:
http://www.ebi.ac.uk/emblSubmission/index.html
or contact database staff at:
EMBL Nucleotide Sequence Submissions
4
CITING THE EMBL NUCLEOTIDE SEQUENCE DATABASE
We encourage authors to include a reference to the EMBL Database in
publications related to their research.
When citing data in the EMBL Database, we suggest to give the according
primary accession number(s) and the publication in which the sequence first
appeared. For unpublished data, we suggest to contact the original submitters
for recent publication information or revisions of the data.
We suggest to also provide a reference for the EMBL Database itself.
Our recent publication in Nucl. Acids Res., 1999, Vol. 27 (1), 18-24.,
which describes the EMBL database, should be cited:
Stoesser, G., Tuli, M.A., Lopez, R., and Sterk, P.. The EMBL Nucleotide
Sequence Database Nucl. Acids Res., 27:18-24(1999)
Example: The numbers in parentheses refer to the REFERENCE in
the EMBL database entry, and to the EMBL citation above.
"Sequence entry X56734 (1) has been retrieved from the EMBL Database
(2) and showed significant sequence similarity to ..."
(1) Oxtoby E., et al., Plant Mol. Biol. 17:209-219(1991).
Computer users with access to Internet (directly or via a gateway) can
obtain copies of database entries, documentation or the data submission
form, by sending commands to a file server running on the computer systems
at EBI. New and updated EMBL nucleotide sequence entries are made available
on the server on a daily basis.
To use this facility, send file server commands (as electronic mail)
to the address Netserv@EBI.AC.UK.
Each line of the mail message should consist of a single file server request.
The most important file server request, to get started, is:
HELP
If the file server receives this command, it will return a help file
to the sender, explaining in some detail how to use the facility. For example,
to request a copy of the data submission form and the nucleotide sequence
with accession number X12399, use the commands:
GET DOC:DATASUB.TXT
The file server offers various other services, (eg., access to nucleotide
and protein sequence data, protein structure data, software), details of
which are provided in the HELP file.
An alternative method of accessing the EBI archives is to use the Internet
file transfer protocol (ftp). Researchers with direct access to the Internet
can use the FTP program on their local machine to connect to the host FTP.EBI.AC.UK
and enter the user name "anonymous" and their email address as password.
The directory pub/help contains detailed information about the data available
from the EBI anonymous FTP server which includes the complete EMBL Nucleotide
Sequence Database releases as well as daily and weekly updates and a cumulative
update file (in UNIX-compressed format) in the following directories:
EMBL quarterly release: pub/databases/embl/release
5.3 World
Wide Web (WWW) Server
The EBI operates a WWW server with URL http://www.ebi.ac.uk/ which gives
access to information about the EBI and it's products and services. Nucleotide
sequences can be retrieved by a simple query by accession number, or more
complex queries can be contructed using an SRS WWW databank browser. Nucleotide
sequences can also be submitted to the database using the interactive submission
system WebIn at URL:
http://www.ebi.ac.uk/emblSubmission/webin.html
The EBI offers two network servers for sequence similarity searches
via electronic mail or interactive WWW forms:
The documentation files are in text format ending with a file
extension of '.txt'.
SRS indices can be found on the FTP server in the srs directory
The release contains the files shown below, in the order listed. File
sizes are given as numbers of records.
DATABASE GROWTH TABLE
The following table shows the growth of the EMBL Nucleotide Sequence
Database at each release.
During a transition phase (6 months) both the old (NI, PID) and new
forms (SV and /protein_id)) of identifiers will be provided (see
example below). Starting from Release 60 (September 1999) only the new
form of identifiers will be included.
http://www.ebi.ac.uk/blastall/vectors.html
2 Release Date Information
3 Sequence Data, Description and Source Information
4 Reference Citation Information
5 Feature Information (e.g. coding regions, regulatory signals etc.)
before submitting the data. Database staff will then assist in making
the submission of this data as convenient as
possible, thus saving the author the time and effort required to complete
numerous submission events individually.
When contacting database staff, authors should indicate the number
of sequences they plan to submit. Database
staff will create series of templates and communicate these to the
author for completion with just the information
unique to each sequence required. These templates, once resubmitted,
will then be processed en masse by
database curators.
Fax: +44-1223-494472
ftp://ftp.ebi.ac.uk/pub/software/sequin/
A sample alignment in NEXUS format can be viewed at ftp://ftp.ebi.ac.uk/pub/databases/embl/align/ds32096.dat
Description Field
Information required
TITLE:
Title of alignment
SUBMITTER:
Name, Affiliation, Phone, Fax, Email
RELEASE DATE:
Public Immediately / if Confidential please provide hold date
CITATION:
If known please provide complete Author list, Title, Journal, Year
of publication, Page numbers
ALIGNMENT METHOD:
Method of alignment and format submitted, parameters of alignment sequences
used (if appropriate)
DESCRIPTION OF SYMBOLS:
e.g. Gaps indicated by a dash '-'
DESCRIPTION OF ALIGNMENT:
Describe sequences aligned, including accession numbers (if known)
and abbreviation of clones or taxon used in alignment file. If your alignment
contains sequences derived from multiple taxoonomic sources, please provide
the full name of each organism
FILE FORMAT:
We are currently updating and improving both the access to and alignment
output of this archive due to an increase in the submission of alignment
data. The compilation of text files and the issue of format standardisation
are undergoing review and are being discussed by the database staff, external
users and experts in the field.
Sequence Alignment Data can be sent to the Nucleotide Sequence Database
by Electronic mail to DATASUBS@EBI.AC.UK
Alignment data and additional information are available via the EBI
servers:
EBI FTP server:
by anonymous FTP from FTP.EBI.AC.UK in directory /pub/databases/embl/align
EBI File server:
by sending an e-mail message to netserv@ebi.ac.uk
including the line HELP ALIGN or
GET ALIGN:DS8200.DAT
3.6.1
New Annotation Guides
e-mail: datasubs@EBI.AC.UK
telephone: +44-1223-494499
telefax: +44-1223-494472
(2) Stoesser, G., et al., Nucl. Acids Res. 27:18-24(1999)
GET NUC:X12399
EMBL updates: pub/databases/embl/new
6 DISTRIBUTION
FILES
FASTA
based on W. Pearson's FASTA algorithm. Allows local similarity searches
of protein and nucleotide sequence databases. Send "help" to fasta@ebi.ac.uk
or use URL http://www.ebi.ac.uk/fasta33/
BLAST
based on the NCBI and WU-BLAST software Send "help" to blast@ebi.ac.uk
or use URL http://www.ebi.ac.uk/blast2/
BLITZ
BLITZ allows very fast searches of protein sequence databases for local
similarities using an exhaustive Smith-Waterman matching algorithm. Compugen's
BIC_SW software is running on a Biocellerator (BIC-2) Send "help" to Blitz@EBI.AC.UK
or use URL http://www.ebi.ac.uk/bic_sw/
(relnotes.txt, usrman.txt)
ftp://ftp.ebi.ac.uk/pub/databases/embl/release/srs/.
Please read the README file for details.
APPENDIX A
File Number
File Name
Description
Number of Records
1
USRMAN.TXT
User Manual
1550
2
RELNOTES.TXT
Release Notes (this document)
1027
3
DATASUB.TXT
Data Submission Form
330
4
DATASUB.DOC
Data Submission Documentation
311
5
UPDATE.DOC
Data Update Form
86
6
FTABLE.DOC
Feature Table Documentation
447
7
ACNUMBER.NDX
Accession Number Index
3996712
8
DIVISION.NDX
Division Index
23
9
SHORTDIR.NDX
Short Directory Index
9221499
10
SPECIES.NDX
Species Index
207197
11
CITATION.NDX
Citation Index
406936
12
KEYWORD.NDX
Keyword.index
1541681
13
EST1.DAT
EST Sequences
7062409
14
EST2.DAT
EST Sequences
7033968
15
EST3.DAT
EST Sequences
7113147
16
EST4.DAT
EST Sequences
6967045
17
EST5.DAT
EST Sequences
7021242
18
EST6.DAT
EST Sequences
7114459
19
EST7.DAT
EST Sequences
6823278
20
EST8.DAT
EST Sequences
6851996
21
EST9.DAT
EST Sequences
6705558
22
EST10.DAT
EST Sequences
7008292
23
EST11.DAT
EST Sequences
7102351
24
EST12.DAT
EST Sequences
7055036
25
EST13.DAT
EST Sequences
6448557
26
EST14.DAT
EST Sequences
5815070
27
EST15.DAT
EST Sequences
5798907
28
EST16.DAT
EST Sequences
6853928
29
EST17.DAT
EST Sequences
5822371
30
EST18.DAT
EST Sequences
5640234
31
EST19.DAT
EST Sequences
5633560
32
EST20.DAT
EST Sequences
6671731
33
EST21.DAT
EST Sequences
6501889
34
EST22.DAT
EST Sequences
6955080
35
EST23.DAT
EST Sequences
7167734
36
EST24.DAT
EST Sequences
7038660
37
EST25.DAT
EST Sequences
6324337
38
EST26.DAT
EST Sequences
1094595
39
FUN.DAT
Fungi Sequences
2641735
40
GSS1.DAT
Genome Survey Sequences
6406607
41
GSS2.DAT
Genome Survey Sequences
6293119
42
GSS6.DAT
Genome Survey Sequences
6455019
43
GSS4.DAT
Genome Survey Sequences
6828223
44
GSS5.DAT
Genome Survey Sequences
6625917
45
GSS6.DAT
Genome Survey Sequences
6737196
46
GSS7.DAT
Genome Survey Sequences
6471607
47
GSS8.DAT
Genome Survey Sequences
3566273
48
HTG.DAT
High Throughput Genome Sequences
6446495
49
HUM1.DAT
Human Sequences
6715037
50
HUM2.DAT
Human Sequences
3622945
51
HUM3.DAT
Human Sequences
2150782
52
HUM4.DAT
Human Sequences
1585576
53
INV.DAT
Invertebrate Sequences
5367594
54
MAM.DAT
Other Mammal Sequences
1444791
55
ORG.DAT
Organelle Sequences
3208680
56
PATENT.DAT
Patent Sequences
5710464
57
PHG.DAT
Bacteriophage Sequences
180433
58
PLN.DAT
Plant Sequences
4239644
59
PRO1.DAT
Prokaryote Sequences
6167851
60
PRO2.DAT
Prokaryote Sequences
1094457
61
ROD.DAT
Rodent Sequences
3909502
62
STS.DAT
STS Sequences
4827205
63
SYN.DAT
Synthetic Sequences
315717
64
UNC.DAT
Unclassified Sequences
112675
65
VRL.DAT
Viral Sequences
5495692
66
VRT.DAT
Other Vertebrate Sequences
1366461
Release
Month
Entries
Nucleotides
1
06/1982
568
585433
2
04/1983
811
1114447
3
12/1983
1481
1654863
4
08/1984
1698
2147205
5
04/1985
2378
2874493
6
08/1985
4835
4567592
7
12/1985
5789
5622638
8
04/1986
6395
6353040
9
09/1986
7630
7813214
10
12/1986
8817
9766948
11
04/1987
11621
12189783
12
07/1987
12706
13638061
13
10/1987
14397
16023478
14
01/1988
15344
17272160
15
05/1988
17961
20318442
16
08/1988
19592
22625941
17
11/1988
20695
24211054
18
02/1989
22938
27249830
19
05/1989
24365
29066676
20
08/1989
26223
31240948
21
11/1989
28679
34748087
22
02/1990
31508
38165786
23
05/1990
34902
42923803
24
08/1990
37784
47354438
25
11/1990
41580
52900354
26
02/1991
43745
55859549
27
05/1991
46871
59915244
28
09/1991
54558
70448052
29
12/1991
5765
75400487
30
03/1992
63378
83574342
31
06/1992
72481
94390065
32
09/1992
79377
101292310
33
12/1992
89100
111413979
34
03/1993
99591
121420828
35
06/1993
108973
131880111
36
09/1993
127933
145401156
37
12/1993
146576
158171400
38
03/1994
167777
177550115
39
06/1994
182615
192195819
40
09/1994
209352
211017104
41
12/1994
230950
226259607
42
03/1995
303206
262559786
43
06/1995
420111
315840053
44
09/1995
506190
363273777
45
12/1995
622566
427620278
46
03/1996
701246
473691480
47
06/1996
827174
550739395
48
09/1996
928067
608931850
49
12/1996
1047263
696183789
50
03/1997
1187455
789755858
51
06/1997
1432941
931351601
52
10/1997
1787004
1181167498
53
12/1997
1917868
1281391651
54
03/1998
2125225
1427634373
55
06/1998
2330040
1607673907
56
09/1998
2689618
1904091473
57
12/1998
3046471
2164718256
58
03/1999
3272064
2355200790
59
06/1999
3952878
2924568545