Downloading assembled and annotated sequences

Assembled and annotated sequences can be downloaded through the ENA Browser or using FTP.

This document provides instructions for FTP downloads only. The data classes referred to in this document are described here.

Downloading sequences via FTP

Assembled and annotated sequences are available for download in flat file format through FTP at: ftp://ftp.ebi.ac.uk/pub/databases/ena/sequence. The directory structure and the file name conventions are described below.

Directory Definition
release

A full release of entries is made every March, June, September and December. This directory consists of 8 subdirectories that contain all sequence and documentation for the latest release.

release/std

All sequences that are not WGS/TSA sets, patent sequences, or scaffolds (con) sequences included in the release.

The data files in this directory use the following naming convention:
rel_<data class>_<taxonomic division>_<number>_r<release number>.dat.gz

release/con

This directory contains scaffolds (built from genomic or transcriptomic contigs) included in the release.

The data files in this directory use the following naming convention:
rel_con_<taxomic division>_<number>_r<release number>.dat.gz

release/expanded_con

This directory contains the expanded version of all scaffolds (built from genomic or transcriptomic contigs with sequences and annotation extracted from the contigs) included in the latest release.

The data files in this directory use the following naming convention:
rel_con_<taxonomic division>_<number>_r<release number>.dat.gz

release/tsa

Transcriptomic contigs included in the release are in the release/tsa directory.

The data files in this directory use the following naming convention:
tsa_<tsa prefix>_<taxonomic division>[_<number>].dat.gz

The optional <number> is used to divide large sets into several smaller files.

release/wgs

Genomic contigs included in the release are organised into subdirectories using the first two letters of the accession prefix under the release/wgs directory.

The data files in this directory use the following naming convention:
wgs_<accession prefix>_<taxonomic division>[_<number>].dat.gz

The optional <number> is used to divide large sets into several smaller files.

release/patent

All patent sequences included in the release.

The data files in this directory use the following naming convention:

rel_pat_<taxomic division>_<number>_r<release number>.dat.gz

release/doc

All documentation regarding the release, including release notes and the latest feature table document.

update

This directory contains holds 6 subdirectories containing all sequences added or updated since the last release.

update/std

This directory contains all updated/news sequences that are not WGS/TSA sets or scaffolds (CON) entries

The main data files in this directory use the following naming convention:
cum_<data class>_<taxonomic division>_<number>_r<release number>.dat.gz

There are also files containing all sequences for a given update which use the following naming convention:
<release number>_r<update number>.dat.gz

update/con

This directory contains scaffolds (built from genomic or transcriptomic contigs) created or updated after the latest release.

The data files in this directory use the following naming convention:
cum_con_<taxomic division>_<number>_r<release number>.dat.gz

update/expanded_con

This directory contains the expanded version of all scaffolds (built from genomic or transcriptomic contigs with sequences and annotation extracted from the contigs) created or updated after the latest release.

The data files in this directory use the following naming convention:
cum_exp_con_<taxonomic division>_<number>_r<release number>.dat.gz

update/tsa

This directory contains transriptomic contigs created or updated after the latest release.

The data files in this directory use the following naming convention:
tsa_<tsa prefix>_<taxonomic division>[_<number>].dat.gz

The optional <number> is used to divide large TSA sets into several smaller files.

update/wgs

This directory contains genomic contigs created or updated after the latest release.

The data files in this directory use the following naming convention:
wgs_<wgs prefix>_<taxonomic division>[_<number>].dat.gz

The optional <number> is used to divide large WGS sets into several smaller files.

Downloading coding, non-coding and rRNA sequences using FTP

Coding, non-coding and rRNA sequences are available for download in flat file and fasta formats through FTP at: ftp://ftp.ebi.ac.uk/pub/databases/ena/. The directory structure and the file name conventions are described below.

Directory Definition
coding/release

This directory contains all protein coding features part of the latest release. Entries are available in both flat file and fasta formats.

Flat files use the following naming convention:
rel_<dataclass>_<taxonomic division>_<number>_r<release number>.cds.gz

Fasta files use the following naming convention:
rel_<dataclass>_<taxonomic division>_<number>_r<release number>.cds.fasta.gz

coding/update This directory contains all protein coding features created or updated after the latest release. Same file naming conventions apply as above.
non-coding/release

This directory contains all non-coding features part of the latest release. Entries are available in both flat file and fasta formats.

Flat files use the following naming convention:
rel_<dataclass>_<taxonomic division>_<number>_r<release number>.ncr.gz

Fasta files use the following naming convention:
rel_<dataclass>_<taxonomic division>_<number>_r<release number>.ncr.fasta.gz

non-coding/update

This directory contains all non-coding features created or updated after the latest release. Same file naming conventions apply as above.

rRNA/release

This directory contains all rRNA features part of the latest release. Entries are available in both flat file and fasta formats.

Flat files use the following naming convention:
rel_<dataclass>_<taxonomic division>_<number>_r<release number>.rRNA.gz

Fasta files use the following naming convention:
rel_<dataclass>_<taxonomic division>_<number>_r<release number>.rRNA.fasta.gz

rRNA/update

This directory contains all rRNA features created or updated after the latest release. Same file naming conventions apply as above.

spacer/release

This directory contains all spacer (ITS, IGS, ETS) features part of the latest release. Entries are available in both flat file and fasta formats.

Flat files use the following naming convention:
rel_<dataclass>_<taxonomic division>_<number>_r<release number>.spacer.gz

Fasta files use the following naming convention:
rel_<dataclass>_<taxonomic division>_<number>_r<release number>.spacer.fasta.gz

spacer/update

This directory contains all spacer features created or updated after the latest release. Same file naming conventions apply as above.

FTP mirror sites

Country URL
UK ftp://ftp.ebi.ac.uk/pub/databases/embl/release
France ftp://ftp-bips.u-strasbg.fr/pub/ebi/pub/databases/embl/release
Finland ftp://ftp.funet.fi/pub/sci/molbio/embl_release
USA ftp://bio-mirror.net/biomirror/embl/release
Japan ftp://bio-mirror.jp.apan.net/pub/biomirror/embl/release
China ftp://ftp.cbi.pku.edu.cn./pub/databases/embl/release
Australia ftp://biomirror.aarnet.edu.au/biomirror/embl/release

Latest ENA news

11 Oct 2017: Read data download issues resolved

Read data download issues previously affecting ftp.sra.ebi.ac.uk and fasp.sra.ebi.ac.uk services now resolved.

06 Oct 2017: ENA read data download issues

Issues with read data download from ftp.sra.ebi.ac.uk and fasp.sra.ebi.ac.uk

04 Oct 2017: ENA Release 133

Release 133 of ENA's assembled/annotated sequences now available