EBI Dbfetch FAQ
- Why do I sometimes get a different header format when requesting fasta format data?
- Why is the order of entries retrieved from a batch request not the same as the order of the identifiers requested?
- Why do I sometimes get multiple entries for a single identifier?
- How can I tell which entry identifiers are not found when performing a batch request?
- How can I find out more about the available databases?
- How can I find out more about the available data formats?
- How can I determine which database an identifier comes from?
- How does dbfetch use multiple data sources?
- How do I cite dbfetch?
- How do I cite WSDbfetch?
- How can I get help with using dbfetch/WSDbfetch?
Databases in dbfetch are commonly configured as having multiple data sources from which the data can be retrieved (see How does dbfetch use multiple data sources?). For example: for the UniProtKB database data can be obtained from UniProt.org or SRS@EBI (see the list of databases for details of the data sources used for each database). When fetching data dbfetch asks the data source to provide the data in the selected format. In the case of the fasta sequence format different data sources use different conventions for the format of the header line. For example:
ENA Sequence fasta sequence format data can come from:
- ENA Browser:
>ENA|M10051|M10051.1 Human insulin receptor mRNA, complete cds.
- NCBI BLAST blastdbcmd:
>EM_HUM:M10051 M10051.1 Human insulin receptor mRNA, complete cds.
UniProtKB fasta sequence format data can come from:
>sp|P12345|AATM_RABIT Aspartate aminotransferase, mitochondrial (Fragment) OS=Oryctolagus cuniculus GN=GOT2 PE=1 SV=1
- NCBI BLAST blastdbcmd:
>SP:AATM_RABIT P12345 Aspartate aminotransferase, mitochondrial (Fragment) OS=Oryctolagus cuniculus GN=GOT2 PE=1 SV=1
Since these are all valid fasta sequence format headers, dbfetch returns the data obtained from the data source.
If you need a specific fasta header format, consider:
- Using a specific data source directly. See the list of databases for details of the data sources used by dbfetch to access data from a particular database.
- Retrieve the full entry data and process to obtain the required format. So for the UniProtKB example above the "uniprot" format entry for P12345 would be retrieved and reformatting tool (e.g. EMBOSS seqret or Readseq) or library (e.g. BioPerl or BioJava) used to produce the required format.
Why is the order of entries retrieved from a batch request not the same as the order of the identifiers requested?
To provide the best performance dbfetch makes a single request to the back-end data source to fetch all the entries in one go. This means that the specific back-end data source used determines the entry order.
If it is required to obtain the entries in an order which allows entries to be associated with specific identifiers, then the best approach is to use a single identifier for each request and make multiple requests. This approach also allows detection of cases where an identifier returns no entries or multiple entries. Alternatively batch retrieval can be used and the entry data processed to identify the entries associated with specific identifiers.
Some databases use persistent identifiers which are maintained throughout the life of the data associated with the identifier. Thus as entries are split and/or merged identifiers can become associated with multiple entries. When dbfetch/WSDbfetch requests entry data from the database it uses the requested identifier(s), and thus may recieve multiple entries for one (or more) identifiers in the request.
When making batch requests with large sets of identifiers against databases which use this type of identifier (e.g. UniProtKB or ENA Sequence), this behavior should be taken into account and the number of identifiers per-request limited to allow for the maximum number of entries returned being limited.
To provide the best performance dbfetch makes a single request to the back-end data source to fetch all the entries in one go. For most data sources the response obtained only contains the data found with no information describing the identifiers which were not found.
To identify identifiers which are not associated with entries either use a single identifier for each request and make multiple requests or retrieve the entries in a batch and process the data obtained to identify the entries associated with specific identifiers.
Details of the databases available via dbfetch can be found on the databases page and in the meta-information available via the web services (see URL Syntax, WSDbfetch (REST) and/or WSDbfetch (SOAP)). This includes a brief description of the database, a pointer to the database's web site, details of the data formats available via dbfetch and semantic annotations for the database and data formats using DRCAT (Data Resource Catalogue), EDAM (EMBRACE Data and Methods) ontology and MIRIAM (Minimal Information Required In the Annotation of Models) Registry terms.
For the primary formats used by a databases see the documentation provided by the database, links to the main web site for each database can be found on the databases page. For example for ENA Sequence, UniProt and InterPro their documentation details the main data formats:
- ENA data formats, includes details of the ENA Sequence, ENA Read and Trace Archive data formats.
- UniProt.org Technical corner, has details of the UniProt data formats.
- InterPro Documentation, details the InterPro XML format.
For generic data formats, such as the fasta sequence format, there are many sites which provide descriptions, for example:
- Sequence Formats in the 2Can Support Portal
- File Format Reference in the EMBOSS Users Guide
- Molecular biology and bioinformatics file formats in Wikipedia
While an entry identifier should be accompanied by provance information detailing the database from which it comes, this is not always the case. So it becomes necessary to try and figure out based on the identifier, and any context provided, the database from which the identifier hails. In some cases this can be done relativly easily by using search engines such as: EBI Search, NCBI Entrez or even Google. However some identifiers are ambigious and could have come from a number of data sources or are common terms in other contexts.
One approach would be to look at a collection of identifiers for various sources and look at those where the identifier has a similar format. The example identifiers provided for each database available via. dbfetch (see the databases page) may prove useful in this context.
Alternativly data resources which describe databases, such as DRCAT (Data Resource Catalogue), the MIRIAM (Minimal Information Required In the Annotation of Models) Registry, and MetaBase often include details of the identifier formats used for entries in each database, and may provide a service to identifiy an identifier.
Dbfetch uses multiple data sources in order to provide a range of data formats wider that than available from a single source and to mitigate the effects of a single data source being unavailable due to maintence of a problem with the data source.
In order to do this, when a request is made to dbfetch the dbfetch configuration data for each data source is checked to determine if the data source can provide the data requested. If the data source can provide the requested data format and result style then dbfetch requests the data from that source. The response from the data source is checked to see if it is in the expected format, if it is not then the next data source capable of providing the data is tried. If all the data sources fail to provide the data in the requested format an appropriate message is returned. The specific message returned depends on the how the data sources responded, see the dbfetch URL syntax guide for details of the possible messages.
The EMBL-EBI's dbfetch service can be cited using:
The EMBL-EBI's WSDbfetch services can be cited using:
If you are having problems, or just have a question about dbfetch or WSDbfetch please let us know via. EMBL-EBI Support.