How to export sequence and download data

Exporting sequences and annotation

You can download single or multiple sequences, with or without their annotation, from any of the ENA databases, including:

  • Downloading a single EMBL-Bank sequence or full entry;

  • Downloading multiple EMBL-Bank sequences or full entries;

  • Downloading sequences or full entries from the taxonomy portal;

  • Downloading SRA sequences and data using SRA-DataDownloader';

  • Bulk downloads using ftp.

 

Help

More information on exporting data from ENA can be obtained from the help pages.

 

Exporting single EMBL-Bank sequences and annotation

Once you have found the EMBL-Bank entry you want, you can use the download links (Figure 44) at the top right of every entry page to easily download either:

  • the sequence in FASTA format;

  • the full entry in either TEXT or XML format.

EMBL-Bank entry for BN000065 showing the download links at the top of every ENA entry page

Figure 44. EMBL-Bank entry for BN000065 showing the download links at the top of every ENA entry page.

Notes

[A] TEXT enables you to download the full entry in flat file format.

[B] FASTA enables you to download the sequence in FASTA format.

[C] XML enables you to download the full entry in XML format.

Exporting multiple EMBL-Bank sequences and annotation

Alternatively, you can download multiple EMBL-Bank sequences or full entries either by:

  • Following the links from the search page results;
  • Uploading your file of accessions.

You have the option of selecting the range of entries to download, which is particularly useful if your search query returns a large number of results (Figure 45).

Results page from a text search on 'human' displaying the download options

Figure 45. Results page from a text search on 'human' displaying the download options.

Notes

[A] TEXT, XML, FASTA download options available for all the sequence or full entries displayed in the results.

  • Note: that there separate download capabilities for Assembled Nucleotide Sequences, Raw Nucleotide Sequences, Projects and Taxa.

[B] From-to range enables you to select which range of sequences or entries you wish to download.

  • In this figure, there are 140885 results, therefore you may want to select fewer results to download.

[C] Upload file of accessions enables you to upload a file containing a list of accessions, which is then displayed as a list of resulting entries; you can then follow the download links [A] and data range capabilities [B] explained above.

Exporting sequences using the taxonomy portal

The ENA browser enables taxonomy from any node of the taxonomic tree to be downloaded in XML format (Figure 46). In addition, bulk taxonomy downloads are possible through the ftp site.

Taxonomy Portal displaying the download options

Figure 46. Taxonomy Portal displaying the download options.

Notes

[A] XML enables you to download all the taxonomy data in XML format.

[B] Navigation enables you to navigate the taxonomy tree, where data from each node of the tree can be downloaded using the XML link in [A].

Information

Taxonomy data can also be downloaded in bulk from the ftp site. More information about the taxonomy data available for ftp download can be viewed on the ENA help pages.

Exporting SRA sequences and data

The ENA browser enables a range of options when downloading raw sequence data from the Sequence Read Archive (SRA) (Figure 47). SRA data can be downloaded in normalised fastq format, or in the original format as submitted by the author. The SRA data can be grouped by study, sample, experiment, run or submission, where each group of sequences can be downloaded separately. Because these sequence files are often very large, in addition to ftp download, the ENA browser enables downloading using the high-speed file transfer software Aspera. Alternatively, you can upload SRA data into the Galaxy platform.

 ENA browser displaying SRA search results and the download capabilities
 
Figure 47. ENA browser displaying SRA search results and the download capabilities.
Notes

[A] Bulk download Fastq/Submitted files provides the ability to select and download multiple files at once.

NOTE: this is the BEST route to download SRA sequences as you can choose which files to download (see Figure 48).

[B] Fastq Files provide SRA sequences in normalised fastq format.

[C] Submitted Files are the unaltered SRA sequence files submitted by the author.

[D] ftp, Aspera, Galaxy download links provide the ability to:

  • use FTP or Aspera to download individual submitted files and Fastq files;
  • upload individual Fastq files to Galaxy.

ADVICE: please consider using 'Bulk download Fastq/Submitted files' to download multiple files at once.

SRA-Filedownloader allows you to select the files you want to download
Figure 48. SRA-Filedownloader allows you to select the files you want to download.
Steps

1. Select SRA entry: type the accession number into box [A] and click [B] Search.

2. Select data format: you can select either normalised fastq files [C] or unaltered files as submitted by the author [D].

3. Select files: you can either select specific files by checking the appropriate boxes [E], or choose to select all files [F].

4. Download files [G].

Help

Note: you will need to download Aspera first for the Aspera link on this page to be active. If you do not have Aspera, then a link to the Aspera download site will be provided at the bottom of the ENA browser page.

Information

For information on SRA formats, see the ENA help pages.

For more information on downloading SRA data, see the ENA help pages.

There is a tutorial on retrieving Sequence Read Archive data.

Bulk downloads using ftp

ENA data can be downloaded in bulk by ftp, including data from EMBL-Bank, SRA, Trace Archive and taxonomy data (Figure 49).

For example, EMBL-Bank sequences that can be directly downloaded from the ftp site include:

  • the entire EMBL-Bank release;
  • new and updated entries made available after the latest release;
  • specific data classes, such as Coding Sequence (CDS) Whole Genome Shotgun (WGS), Mass Genome Annotation (MGA) or the Construct (CON).

ftp site for the download of data from the EMBL-Bank full release

Figure 49. ftp site for the download of data from the EMBL-Bank full release.

Help

More information on the EMBL-Bank data available through ftp can be obtained from the help pages.

 

Information

The main directories (in flat-file format) for EMBL-Bank are:

EMBL-Bank full release:

ftp://ftp.ebi.ac.uk/pub/databases/embl/release/

EMBL-Bank entries released/updated after release:

ftp://ftp.ebi.ac.uk/pub/databases/embl/new/

ftp://ftp.ebi.ac.uk/pub/databases/embl/wgs/

ftp://ftp.ebi.ac.uk/pub/databases/embl/con/

EMBL-CDS (coding sequence) entries:

ftp://ftp.ebi.ac.uk/pub/databases/embl/cds/