REST URLs to search ENA data
This page describes how to use free text and advanced search programmatically to search data in ENA.
While this documentation focuses on the full functionality that is offered by the RESTful interface to the ENA Advanced Search, it can also serve to provide assistance to users of the Query Builder web interface to this service, which offers cut-down functionality (such as supporting only Boolean 'AND' operations). We will provide more specific documentation for the Query Builder interface soon.
While searches within the ENA browser are performed and/or displayed via domains, programmatic access is only available when a result is declared. A domain comprises a number of results that are deeper partitions of the ENA content. For queries based on these more granular results, display/download format and pagination options are available. While a domain is a partition of content based on the conceptual nature of content (e.g. raw sequence reads vs. annotated assembled sequences) a result is a partition that also takes into account the structure of the underlying content. Because diverse structures are used in ENA for managing different data, it is only at the level of results that some format options are made available.
Free text search
The URL syntax for retrieving records from ENA via free text search is:
http://www.ebi.ac.uk/ena/data/search?query=<query string>&result=<result>[Pagination options][Display options][Download options]
The query string is made up of terms joined with "+". For example, to search for human kinase sequences, the search query would be "kinase+homo+sapiens". To fetch these sequences in FASTA format, the following URL could be used:
http://www.ebi.ac.uk/ena/data/search?query=kinase+homo+sapiens&result=sequence_release&display=fasta
By default the first 100,000 records are returned If you wish to download more than this, you will need to use the pagination options. To determine how many results are available for your search, add the resultcount parameter to your query:
http://www.ebi.ac.uk/ena/data/search?query=<query string>&result=<result>&resultcount
Using the data warehouse
The URL syntax for retrieving records from the ENA data warehouse programmatically is:
http://www.ebi.ac.uk/ena/data/warehouse/search?query=<query string>&result=<result>[Pagination options][Display options][Download options]
By default, the first 100,000 records are returned. If you wish to download more than this, you will need to use the pagination options. To determine how many results are available for your search, add the resultcount parameter to your query:
http://www.ebi.ac.uk/ena/data/warehouse/search?query=<query string>&result=<result>&resultcount
Examples
Return coding sequences, in fasta format, from the STD dataclass for all members of the phylum Diptera (Taxon ID 7147):
http://www.ebi.ac.uk/ena/data/warehouse/search?query=%22tax_tree(7147)%20AND%20dataclass=%22STD%22%22&result=coding_release&display=fasta
http://www.ebi.ac.uk/ena/data/warehouse/search?query=%22tax_tree(7147)%20AND%20dataclass=%22STD%22%22&result=coding_update&display=fasta
Note that both the coding_release and coding_result are required to get all coding results.
Download a compressed flat file representing sequences from in and around the Galapagos Islands:
http://www.ebi.ac.uk/ena/data/warehouse/search?query="geo_circ(-0.587,-90.5713,170)"&result=sequence_release&display=text&download=gzip
Download all paired RNA-seq reads from Hi-Seq platforms in XML format: http://www.ebi.ac.uk/ena/data/warehouse/search?query=%22(instrument_model=%22Illumina%20HiSeq%202000%22%20OR%20instrument_model=%22Illumina%20HiSeq%201000%22%20OR%20instrument_model=%22Illumina%20HiSeq%202500%22)%20AND%20library_layout=%22PAIRED%22%20AND%20library_source=%22TRANSCRIPTOMIC%22%22&result=read_run&display=xml&download=xml
Retrieve genome assemblies for the house mouse (Mus musculus, Taxon ID 10090):
http://www.ebi.ac.uk/ena/data/warehouse/search?query="tax_eq(10090)"&result=assembly&display=xml
Domains and results
The available domains and results are listed here.
Query string
The query string is made up of filtering conditions, joined by logical ANDs, ORs and NOTs and bound by double quotes. The use of parentheses is also supported. For example, the following query string could be used: query="<filter1> AND (<filter2> OR <filter3>) OR NOT <filter4>"
For ease of reading, query strings have not been URL encoded in the examples below.
Filter types
The following filter types are supported:
- boolean filter
- controlled vocabulary filter
- date filter
- number filter
- text filter
- geospatial filter
- taxonomy filter
Boolean filter
Operator | = |
---|---|
Value | yes, true, no, false |
Example | environmental_sample=true |
Controlled vocabulary filter
Operator | =, != |
---|---|
Value | A text value from the controlled vocabulary enclosed in double quotes |
Example | library_source="GENOMIC" |
Date filter
Operator | =, !=, <, <=, >, >= |
---|---|
Value | A date in the format YYYY-MM-DD |
Example | first_public > 2012-01-01 |
Number filter
Operator | =, !=, <, <=, >, >= |
---|---|
Value | Any integer |
Example | base_count > 4000000 |
Text filter
Operator | =, != |
---|---|
Value | Any text value enclosed in double quotes. Wildcard (*) can be used at the start and/or end of the text value. |
Example | library_name =”*HUM*" |
Geospatial filter
Function | Description | Parameters | Example |
---|---|---|---|
geo_box1 | All locations within a box defined by the lower left (SW) and upper right (NE) points. | south-west latitude, south-west longitude, north-east latitude, north-east longitude | geo_box1(-20, 10, 20, 50) |
geo_box2 | All locations within a box defined by a centre point and a radius in km. | latitude, longitude, radius (km) | geo_box2(35, 100, 300) |
geo_circ | All locations within a circle defined by a centre point and a radius in km. | latitude, longitude, radius (km) | geo_circ(35, 100, 300) |
geo_lat | All locations within a latitude range given by a latitude and a radius in km. | latitude, radius (km) | geo_lat(0, 100) |
geo_north | All locations north of a given latitude (inclusive). | latitude | geo_north(80) |
geo_south | All locations south of a given latitude (inclusive). | latitude | geo_south(-80) |
geo_point | An exact lat/lon position | latitude, longitude | geo_point(9.12,-79.7) |
Taxonomy filter
Function | Description | Parameters | Example |
---|---|---|---|
tax_eq | All records that match the given NCBI taxonomy identifier | NCBI taxonomy identifier | tax_eq(9606) |
tax_tree | All records that match the given NCBI taxonomy identifier or are descendants of it | NCBI taxonomy identifier | tax_tree(2759) |
tax_name | All records that match the given NCBI scientific name | NCBI scientific name | tax_name("Homo%20sapiens") |
Filter conditions
The geospatial and taxonomy filters are function based. All other filters use the following syntax: <filter column> <operator> <value>
Filter columns
A full list of filter columns is available here.
Retrieve tabulated data from the data warehouse
In addition to the formats listed above, a tab separated report of data can be returned for each result (that is, this report cannot be returned if searching by domain rather than result). The URL format for retrieving these reports is: http://www.ebi.ac.uk/ena/data/warehouse/search?query=<query string>&result=<result>&fields=<fields>&display=report[&sortfields=<sortfields>][&download=txt][Pagination options]
Each result has a default accession column. This is returned as the first column of the report regardless of whether or not it was listed in the fields to be retrieved. The report will be sorted by this accession column unless sortfields is provided, in which case it will sort in the order of the listed columns. The additional fields that are able to be returned in these reports are listed here. Note that many of these are the same as the searchable fields listed above, however there are generally more returnable than searchable fields for a given result. Taxonomic information can be returned via tax_id or scientific_name, and geospatial information is returned using location.
Note: as of 17th June 2014, the format of the date in the tabulated report changed to ISO format. We support single dates (YYYY-MM-DD) and date ranges (YYYY-MM-DD/YYYY-MM-DD).