Downloading assembled and annotated sequences

Assembled and annotated sequences can be downloaded through the ENA Browser or using FTP.

This document provides instructions for FTP downloads only. The data classes referred to in this document are described here.

Downloading sequences via FTP

Assembled and annotated sequences are available for download in flat file format through FTP at: ftp://ftp.ebi.ac.uk/pub/databases/embl/. The directory structure and the file name conventions are described below.

Directory Definition
release

A full release of entries is made every March, June, September and December. Genomic and transcriptomic contigs are available in their own subdirectories (see below).

The data files in this directory use the following naming convention:
rel_<data class>_<taxonomic division>_<number>_r<release number>.dat.gz

release/tsa

Transcriptomic contigs included in the release are in the release/tsa directory.

The data files in this directory use the following naming convention:
tsa_<tsa prefix>_<taxonomic division>[_<number>].dat.gz

The optional <number> is used to divide large sets into several smaller files.

release/wgs

Genomic contigs included in the release are organised into subdirectories using the first two letters of the accession prefix under the release/wgs directory.

The data files in this directory use the following naming convention:
wgs_<accession prefix>_<taxonomic division>[_<number>].dat.gz

The optional <number> is used to divide large sets into several smaller files.

new

This directory contains entries created or updated after the latest release. Please note the symbolic links to the wgs directory.

The main data files in this directory use the following naming convention:
cum_<data class>_<taxonomic division>_<number>_r<release number>.dat.gz

wgs

This directory contains genomic contigs created or updated after the latest release.

The data files in this directory use the following naming convention:
wgs_<wgs prefix>_<taxonomic division>[_<number>].dat.gz

The optional <number> is used to divide large WGS sets into several smaller files.

wgs/masters This directory contains all whole genomic or transcriptomic assembly master entries in a single file:

wgs_masters.dat.gz

con

This directory contains scaffolds (build from genomic or transcriptomic contigs) created or updated after the latest release.

The data files in this directory use the following naming convention:
cum_con_<taxomic division>_<number>_r<release number>.dat.gz

expanded_con/release

This directory contains all scaffolds (build from genomic or transcriptomic contigs with sequences and annotation extracted from the contigs) included in the latest release.

The data files in this directory use the following naming convention:
rel_con_<taxonomic division>_<number>_r<release number>.dat.gz

expanded_con/new

This directory contains all scaffolds (build from genomic or transcriptomic contigs with sequences and annotation extracted from the contigs) created or updated after the latest release.

The data files in this directory use the following naming convention:
cum_exp_con_<taxonomic division>_<number>_r<release number>.dat.gz

Downloading coding, non-coding and rRNA sequences using FTP

Coding, non-coding and rRNA sequences are available for download in flat file and fasta formats through FTP at: ftp://ftp.ebi.ac.uk/pub/databases/ena/. The directory structure and the file name conventions are described below.

Directory Definition
coding/release

This directory contains all protein coding features part of the latest release. Entries are available in both flat file and fasta formats.

Flat files use the following naming convention:
rel_<dataclass>_<taxonomic division>_<number>_r<release number>.cds.gz

Fasta files use the following naming convention:
rel_<dataclass>_<taxonomic division>_<number>_r<release number>.cds.fasta.gz

coding/update This directory contains all protein coding features created or updated after the latest release. Same file naming conventions apply as above.
non-coding/release

This directory contains all non-coding features part of the latest release. Entries are available in both flat file and fasta formats.

Flat files use the following naming convention:
rel_<dataclass>_<taxonomic division>_<number>_r<release number>.ncr.gz

Fasta files use the following naming convention:
rel_<dataclass>_<taxonomic division>_<number>_r<release number>.ncr.fasta.gz

non-coding/update

This directory contains all non-coding features created or updated after the latest release. Same file naming conventions apply as above.

rRNA/release

This directory contains all rRNA features part of the latest release. Entries are available in both flat file and fasta formats.

Flat files use the following naming convention:
rel_<dataclass>_<taxonomic division>_<number>_r<release number>.rRNA.gz

Fasta files use the following naming convention:
rel_<dataclass>_<taxonomic division>_<number>_r<release number>.rRNA.fasta.gz

rRNA/update

This directory contains all rRNA features created or updated after the latest release. Same file naming conventions apply as above.

spacer/release

This directory contains all spacer (ITS, IGS, ETS) features part of the latest release. Entries are available in both flat file and fasta formats.

Flat files use the following naming convention:
rel_<dataclass>_<taxonomic division>_<number>_r<release number>.spacer.gz

Fasta files use the following naming convention:
rel_<dataclass>_<taxonomic division>_<number>_r<release number>.spacer.fasta.gz

spacer/update

This directory contains all spacer features created or updated after the latest release. Same file naming conventions apply as above.

FTP mirror sites

Country URL
UK ftp://ftp.ebi.ac.uk/pub/databases/embl/release
France ftp://ftp-bips.u-strasbg.fr/pub/ebi/pub/databases/embl/release
Finland ftp://ftp.funet.fi/pub/sci/molbio/embl_release
USA ftp://bio-mirror.net/biomirror/embl/release
Japan ftp://bio-mirror.jp.apan.net/pub/biomirror/embl/release
China ftp://ftp.cbi.pku.edu.cn./pub/databases/embl/release
Australia ftp://biomirror.aarnet.edu.au/biomirror/embl/release

Latest ENA News

20 Aug 2014: Read data through Globus GridFTP
Read data can now be downloaded using Globus GridFTP through ebi#ena Globus Online public endpoint.

18 Aug 2014: Changes to SRA XML 1.5
Small changes to Experiment XML, Analysis XML, EGA Dataset XML, EGA DAC XMLs were deployed on 11th of August 2014.

1 Jul 2014: ENA release 120
Release 120 of ENA's assembled/annotated seqences now available

23 May 2014: Change to date format for advanced search
From 16th June 2014, the date format used in the advanced search will be changed to ISO format (YYYY-MM-DD).

20 May 2014: Update to the ENA SAMPLE checklist
From 10th of June 2014 the ENA SAMPLE checklist XML will be updated and the older version will be deprecated.