New Transcriptome Shotgun Assembly (TSA) dataclass
From June 2008, EMBL will introduce a new dataclass for Transcriptome
Shotgun Assembly (TSA) data. TSA entries will be available as part of
future releases, update products and srs (srs.ebi.ac.uk) during 2008.
The structure of TSA entries is similar to that of TPA entries, but with
a modified AH line; instead of 'TPA_SPAN', the term 'LOCAL_SPAN' will be
used.
Done
December 2008
AH line change in TPA entries
In December 2008, the AH line in TPA entries will be changed such
that'TPA_SPAN' will be substituted with 'LOCAL_SPAN' to be uniform
across all dataclasses with AH lines.
Done
October 2008
Changes to /mol_type qualifier
In October 2008, a new /mol_type qualifier value, 'transcribed RNA' will
be added and values 'snoRNA', 'snRNA', 'scRNA', 'pre-RNA' and 'tmRNA'
will be dropped.
Done
Now and December 2008
Change to information content of AC lines In some entries, AC lines cite not only primary or primary and secondary
accession numbers, but also cite accession numbers of CON entries that
have been assembled from the entry in which the AC line appears. In
order to render consistent usage of the AC line, we will disallow
references to CON entries henceforth and remove legacy instances over by
December 2008.
Done
December 2008
Removal of <er> publication type in RL lines
In December 2008, the electronic publication token in RL lines ('')
will become invalid. Legacy records will be converted to conventional
article citations where possible.
Done
15th of October
/specific_host qualifier will become /host qualifier
For clarity, the /specific_host qualifier will be replaced with a new
qualifier, /host, on the 15th of October. Legacy records will be updated
to reflect this change.
Done
October 2008
Removal of the /virion qualifier
The /virion qualifier will become illegal in October 2008. The /proviral
qualifier will remain in use.
Done
October 2008
Change to the /frequency qualifier
In October 2008, we will permit value formats for the /frequency
qualifier in addition to decimal fractions in order to represent sample
size; an example of such a value format is '2 in 25'.
Done
October 2008
Removal of /cons_splice In October 2008, the /cons_splice qualifier will become illegal.
Done
October 2008
Expected further feature table format changes
In October 2008, we expect further minor changes to a number of
qualifiers. In particular, we expect to add the new qualifier,
/mating_type and to modify usage of /gene, /germline, /inference.
Details of changes will be made available on this page shortly.
Done
December 2007
New citation cross-reference resource, AGRICOLA, to appear on RX line
AGRICOLA is a bibliographic database of over 4 million publications and
resources encompassing all aspects of agriculture and allied
disciplines, including animal and veterinary sciences, entomology, plant
sciences, forestry, aquaculture and fisheries, farming and farming
systems, agricultural economics, extension and education, food and human
nutrition, and earth and environmental sciences. AGRICOLA is maintained
by the US National Agriculture Library (NAL) of the US Department of
Agriculture (USDA). Please see http://agricola.nal.usda.gov/ for more
details.
Done
December 2007
Change to FTP organisation, affecting ANN dataclass entries
Annotated constructed (ANN dataclass) entries have been included in the
EMBL release from release 92. To reflect this, from December 2007, the
annotated_con FTP directory
(ftp://ftp.ebi.ac.uk/pub/databases/embl/annotated_con) will be organised
in the same way as the CON and TPA directories
(ftp://ftp.ebi.ac.uk/pub/databases/embl/con and
ftp://ftp.ebi.ac.uk/pub/databases/embl/tpa, respectively). Release files
will no longer be present in the annotated_con directory (but rather in
the release directory as rel_ann_*.dat). List and cumulative files will
be present in the annotated_con directory.
Done
December 2007
Change to WGS Master flatfile CON linetype
Currently, ANN and CON accession ranges in WGS Masters are reported
separately. Because ANN and CON accession numbers can be interleaved
this significantly increased the size of some WGS Masters (e.g.
AACY00000000). Therefore, ANN accession ranges will be merged together
with CON accession ranges in a single line, starting with text 'CON'.
Done
October 2007
New feature "tmRNA" and qualifier /tag_peptide
New feature "tmRNA" (definition: "transfer messenger RNA") is going to
be introduced into the Feature Table document in October 2007, to be
implemented in December 2007. "tmRNA" feature will have a new qualifier
/tag_peptide.
Done
October/December 2007
New qualifiers /culture_collection and /bio_material
New qualifiers /culture_collection and /bio_material will be introduced
into the Feature Table document in October 2007 and implemented in
December 2007
New qualifiers structure:
/culture_collection="<institution-code>:[<collection-code>:]<culture_id>"
where
<collection-code> token is optional
/bio_material="[
<institution-code>:[<collection-code>:]]<material_id>"
where <collection-code> and <institution-code> tokens are optional
Done
September 2007
Annotated CON (ANN) entries will be included in the quartely release
starting from release 92 in September 2007.
Done
September 2007
Line type change for expanded CON dataclass entries
CC lines are currently in use for the representation of assembly
information in expanded CON dataclass entries. For consistency between
unexpanded and expanded CON dataclass entries, assembly information will
be represented in CO lines in expanded CON dataclass entries.
Done
October 2007
"old_sequence" feature becomes illegal for new entries
"old_sequence" feature becomes illegal for new entries starting from
October 2007, with the new edition of the Feature Table Document
Done
October 2007
"5'clip" and "3'clip" features become illegal
"5'clip" and "3'clip" features become illegal starting from October
2007, with the new edition of the Feature Table Document. Existing
instances of those features are going to be retrofitted.
Done
October 2007
/organism qualifier becomes illegal on "misc_recomb" feature
Qualifier /organism becomes illegal on "misc_recomb" feature starting
from October 2007, with the new edition of the Feature Table Document
Done
October 2007
/operon qualifier becomes legal on "protein_bind" feature
Qualifier /operon becomes legal on "protein_bind" feature starting from
October 2007, with the new edition of the Feature Table Document
Done
October 2007
/specimen_voucher qualifier becomes structured
/specimen voucher qualifier becomes structured starting from October 2007
New qualifier structure:
/specimen_voucher="[<institution-code>:[<collection-code>:]]<specimen_id>"
Where both <collection-code> and <institution-code> tokens are
optional.
Due to the optional nature of second and first tokens, no retrofit is
required for the existing entries.
Done
October/December 2007
New feature "ncRNA" and qualifier /ncRNA_class
New feature "ncRNA" (definition: "a non-protein-coding gene, other than
ribosomal RNA and transfer RNA, the functional molecule of which is the
RNA transcript") is going to be introduced into the Feature Table
document in October 2007, to be implemented in December 2007. This
feature replaces scRNA, snRNA, snoRNA features; it also replaces
misc_RNA feature where it is currently used to annotate microRNAs.
"ncRNA" feature will have mandatory qualifier /ncRNA_class.
Done
March 2007
New XML attribute
A new XML attribute projectAccession will be introduced into EMBL XML
entry element to
contain INSDC
-assigned ID for the sequencing
projects.
At the same time, EMBL XML will start supporting entry types ANN (annotated constructed entry) and and TPA (Third Party Annotation entry)
Done
March 2007
March release of EMBL database - New line type for project ID's
New line type with two-character line type code PR, will be
introduced into EMBL flatfiles with the March release of EMBL database.
The line will contain INSDC-assigned ID for the sequencing project.
Line structure for the PR lines:
PR Project:17285;
where "17285" is the project identifier (integer)
Done
December 2006
Creation of a new division - Transgenic (TGN)
A new database taxonomic division, Transgenic (TGN), will be created in
the December 2006 release. Entries representing transgenic organisms
(indicated by the inclusion of the /transgenic qualifier in one of the
source features), currently stored in the Synthetic (SYN) division, will
be stored in the new TGN division.
Done
December 2006
New qualifier /mobile_element and dropping of two existing qualifiers
New qualifier /mobile_element will be introduced in December 2006 to
hold type and name or identifier of the mobile
element which is described by the parent feature . At the same time, two
less generic qualifiers - /transposon and /insertion sequence are going
to be dropped and all existing instances of them will be retrofitted to
make use of the new qualifier.
Done
October 2006
Amino Acid Abbreviation Change
A single-letter amino acid abbreviation "O" will be used to
represent pyrrolysine in the CDS translation starting from October
2006.
Done
October 2006
Usage of qualifier /operon
Qualifier /operon will become valid on the "rRNA" feature.
Done
19 June 2006
EMBL database release 87 will become public on Monday, 19th June (afternoon)
The data will be published in the following ftp directory
ftp://ftp.ebi.ac.uk/pub/databases/embl/release Changes to the release file names and changes to ID line (described
below) will be implemented in this release.
The daily data distribution will start producing files with new-style
ID
line after release 87 is public; first distribution is scheduled to
start on Monday 19th and first daily files with new format ID line
will
appear on the ftp on Tuesday, 20th June.
Done
June 2006
Release file names change
Starting from the EMBL release 87 June 2006 the naming of the release files will
change in accordance to the new ID line structure (see relevant item).
Data will be split will according to the data class and the taxonomic division.
Starting from EMBL release 87 June 2006 the naming of the release data files
has
changed. The data file names now looks as follows
rel_dtc_tax_nn_rRN.dat
where
"dtc" is a three lowercase letters abbreviation for the dataclas
"tax" is a three lowercase letters taxonomic division abbreviation
"nn" - number of the file in a particular sequence (starting from "01")
"RN" - number of the release where the file belongs
Examples:
rel_est_hum_01_r87.dat
rel_htg_mus_04_r87.dat
cum_est_hum_01_r87.dat
cum_htg_mus_04_r87.dat
Dataclass list : EST, GSS, HTC, HTG, PAT, STS, STD, TPA, CON
Taxonomic division list : HUM, MUS, ROD, PRO, MAM, VRT, FUN, PLN, ENV, INV, SYN,
UNC,
VRL, PHG
Filesize will be kept under 4 Gb by regulating the number of entries in each
file.
File name change doesn't affect WGS data, indexes and accompanying documentation.
Done
June 2006
ID line changes
Now that release 87 (available since JUN-2006) the format of the EMBL
flat file has undergone a change: the ID line now has a different structure
(see below) and the SV line has been removed.
The changes affecting the ID line structure are:
All tokens will be separated by a semicolon.
The entry name will not be displayed, in its place there will be the
primary accession number.
The sequence version will be indicated.
The topology will be a separate token and will be indicated for both
circular and linear molecules.
Both the data class and the taxonomic divisions will be displayed.
The entry name will not be displayed any more in the ID line. Since EMBL
release 3
(Dec 1983) the stable identifier of an entry has been the primary accession
number.
A mapping file (entryname to accession number) will be provided in the future
for
those entries where the entryname doesn't coincide with the accession number.
To give users a test dataset, one file with new-style ID lines called
new_id_line.test.gz was provided together with the March release of the
EMBL database. The file should be used for testing purposes only, i.e.
the data contained in it shouldn't be considered a part of the release;
the data will not be included into any of the release statistics.
In order to facilitate the changeover two small utilities were released:
'new2oldID.pl' and 'old2newID.pl'. They can be used to convert EMBL flat
files from the old to the new format and vice-versa.
In the same directory, a new version of SynCron tools for maintaining
synchronised copies of the EMBL database updates can be found.
Note : This version of SynCron will work only with the new ID line
format. Please switch to it now that EMBL release 87 is public.
Done
April 2006
Changes to the Feature Table Document: Chapter 3.5 "Location"
the use of range (.) descriptor within location spans will no longer be legal
from April 2006.
Done
From October 2005
Changes to the Feature Table Document: Chapter 3.5 "Location"
combinations of "join" and "order" operators in one location will be illegal
from October 2005
the use of two identical location construction operators within one
location will be illegal from October 2005
the usage of '^' will be restricted to adjacent nucleotides from October
2005
the use of range (.) descriptor within location spans will no longer be legal
from April 2006.
Attention : the date for this change has been changed from December 2005 to
March 2006
Done
March 2006
Release indices to be discontinued : March 2006 release of EMBL database
All release indices (files with names like *.ndx) apart from division.ndx are going to be discontinued starting from the March release of EMBL database. Feedback is sought from users (http://www.ebi.ac.uk/support/)
Done
March 2006
Qualifier order change : March 2006 release of EMBL database
Changes - "source" feature to be added to all entries in EMBLCDS dataset
Shortly after the release 86 of the EMBL database in March 2006, all
EMBLCDS entries will include the "source" feature with all the relevant
biological
source information derived from the parent EMBL entry.
EMBLCDS dataset can be found at ftp://ftp.ebi.ac.uk/pub/databases/embl/cds/ and
is
accessible via EBI SRS and webservices.
Done
Dec 2005
ORG division to be dissolved
From Dec 2005, the ORG division of the EMBL database will be dissolved. The entries which are now forming
ORG division will be directed into the appropriate taxonomic divisions.
For the continuity, org.dat file is going to be created and placed into ftp://ftp.ebi.ac.uk/pub/databases/embl/misc/ after each release.
Done
December 2005
The following new qualifiers were introduced into the "Feature Table document"
in
October 2005:
A new prefix (misc) will be introduced to mark those citations where no ISSN is assigned to
the publication, such as proceedings and abstracts.
Example:
RL (misc) Proc. 7th Int. Symp. Biolumin. Chemilumin. 7:142-145(1993).
Done
June release
of
EMBL database
Weekly index files to be dropped
After the June 2005 release of the EMBL database weekly index files (files with
names
like DD-MMM-YYYY.NEW and file newentries.ndx in directory
ftp://ftp.ebi.ac.uk/pub/databases/embl/new/) are
not going to be produced any longer. If you think that your work is going to be
affected by this, please write to us, using the feedback form at
http://www.ebi.ac.uk/support/ (please select "EMBL" from
the list of options).
Done
June release
of
EMBL database
MEDLINE identifiers to be dropped
Starting from the June 2005 release of EMBL database, MEDLINE identifiers
will only be printed in the flatfiles when no corresponding PUBMED id is
available.
In majority of cases it means that there will be no MEDLINE identifiers in
the flatfiles, only PUBMED
Done
June release
of
EMBL database
A new division - "ENV" (Environmental) will be created in EMBL database.
Entries that have "environmental samples" in the taxonomic lineage and/or
"/environmental_sample" qualifier will be placed in this division.
Done
March release
of
EMBL database
InterPro cross-references will be added to all of EMBL database.
Cross-referencing to InterPro is done via UniProt and Uniparc.
Uniparc is used to check for 100% protein sequence identity between the contents of /translation qualifier in EMBL entry and the sequence in UniProt entry; cross-references to InterPro are only inherited from those UniProt entries where the sequences are identical.
Done
March release
of
EMBL database
Secondary accession number ranges in AC line
Starting from next release, consecutive secondary accession numbers in EMBL database flatfiles will be shown in the form of accession number ranges
Example
AC line that now appears:
AC Y00001; X00001; X00002; X00003; X00004; X00005;
will appear:
AC Y00001; X00001-X00005;
A mixture of ranges and single accession numbers will be possible.
AC Y00001; X00001-X00005; X00008; Z00001-Z00005;
The first item in the AC line is the primary accession number; the primary accession number of a given entry will not be displayed as a part of a range.
Note: lists of accession numbers will continue to be syntactically legal in EMBL flatfiles
Cross-referencing to InterPro is done via UniProt and Uniparc.
Uniparc is used to check for 100% protein sequence identity between the contents of /translation qualifier in EMBL entry and the sequence in UniProt entry; cross-references to InterPro are only inherited from those UniProt entries where the sequences are identical.