How to Access ENA Programmatically

There are a number of REST APIs available for programmatic access of the European Nucleotide Archive. These enable programmatic access to the functionality of the ENA Advanced Search as well as direct download of ENA records and associated files.

Please see the relevant guides below for examples and tutorials on ENA programmatic data access and retrieval.

Perform Searches

All functionalities of the ENA Advanced Search can be performed programmatically using a combination of the ENA Portal API and the ENA Browser API. You can download the API docs for the Portal API here and the Browser API here.

You can further explore related records outside of the European Nucleotide Archive by programmatically accessing the ENA Cross Reference Service.

For examples and tutorials on how to use these APIs, please see the guidelines below:

Retrieve and Download Records

All public records within ENA are available to retrieve from the ENA Browser API so records can be programmatically downloaded directly from the API. Associated files can be downloaded using FTP or Aspera protocol.

For a quick summary of metadata and file retrieval locations of records, you can use the ENA file reports.

For further simplicity, enaBrowserTools can be downloaded and run locally on the command line to fetch files associated with records by accession. It can also be used to bulk download records related to a specified Sample or Study.

For examples and tutorials on how to use the Browser API, file reports and enaBrowserTools, please see the guidelines below:

Simplify Queries by Using the Tags on Samples, Taxonomy and Other Records

The tags are controlled textual annotations on objects. There are already programmatically created by the ENA team making use of appropriate metadata property values.

The purpose of these is to make searching and filtering much easier.

Access the CRAM Reference Registry

The CRAM reference registry provides access to reference sequences used in CRAM files. Retrieval of reference sequences from the CRAM reference registry is provided by MD5 or SHA1 checksum through the endpoints documented in the CRAM reference registry API.

CRAM Format

CRAM is a sequencing read file format that is highly space efficient by using reference-based compression of sequence data and offers both lossless and lossy modes of compression. The format specification for CRAM is maintained by the Global Alliance for Genomics and Health (GA4GH) whose members provide multiple implementations and coordinate future specification changes.

The CRAM reference registry is used by GA4GH Samtools.

CRAM Reference Registry reverse proxy

To reduce network traffic originating from the use of the CRAM Reference Registry we recommend using locally cached reference sequences. In addition to local caches supported by Samtools, it is possible to cache sequences using an HTTP proxy.

In the tutorial below, the Squid is used as a reverse proxy to cache reference sequences retrieved from the CRAM Reference Registry:

Rate Limits

In order to ensure a smooth and fair user experience, we have implemented rate limits on our data discovery and retrieval RESTful APIs.

It helps us in maintaining optimal performance and preventing overload on our servers. By regulating the number of requests from individual users, we can ensure that everyone gets a consistent and responsive experience. It also acts as a protective measure against malicious activities such as DDoS attacks and brute-force attempts.

At present we have set the upper limit at 50 requests per second which we think should be sufficient for most of the use-cases. If the requests breach this limit then it will be rejected with error “Too Many Requests” (HTTP status code 429).