Help & Documentation

What is EBI Search?

EBI Search [1], [2], [3], also named as 'EB-eye', is a scalable search engine that:

  • provides text search functionality and uniform access to resources and services hosted at the European Bioinformatics Institute (EMBL-EBI);
  • is based on the consolidated  Apache Lucene  technology;
  • exposes both a Web and  SOAP / RESTful Web Services interfaces;
  • provides inter-domain navigation via a network of cross-references.

How to cite

To cite EBI Search, please refer to the following publication:

Squizzato S., Park Y.M., Buso N., Gur T., Cowley A., Li W., Uludag M., Pundir S., Cham J.A., McWilliam H., Lopez R. (2015)
The EBI Search engine: providing search and retrieval functionality for biological data from EMBL-EBI
Nucleic Acids Research, April 8, 2015; doi: 10.1093/nar/gkv316

If you use EBI Search APIs in your projects, please add 'Powered by EBI Search' in your search pages or any appropriate places.

What can you search for?

EMBL-EBI hosts a vast amount of molecular data and other information that is indexed by EBI Search. This includes gene and protein sequences, protein families, structures, gene expression data, protein interactions, pathways and small molecules, to name a few. You can also search across the academic literature and patents as well information about our institute and staff members. In EBI Search boxes you can enter any meaningful term to find relevant information typing for example accession numbers/identifiers (such as VAV_HUMAN), gene symbols (for instance tpi1), species or keywords. For more complex queries you can have a look at EBI Search  query syntax .

Definition of a domain

A domain is a data resource in EBI Search.

  • Domains are organised in a hierarchy.
  • A category is a first-level domain such as Genomes, Nucleotide sequences, Protein sequences, Gene expression, Macromolecular structures etc.
  • A leaf domain is a data resource node with no children.

Search results page

When you enter some text in EBI Search box you get back a results web page for the query just executed. This page is organised into three main column sections: on the left there is a summary of the hits per category/domain with available facets displayed below; in the middle there is the actual list of search results; on the right related data and alternative views are shown. By clicking on the filter on the left side, a user can narrow down search results at level of category or domain. Together with the search results related to a given category / domain, a list of buttons to save results or launch a tool, and a RSS link (Create alert using RSS) are displayed.

Summary

The navigation summary on the left allows users for a compact view and easy navigation across different categories and domains. It provides a means for exploring the search results grouped in relevant subsets and drilling down the scope of the results.

Facets

Vertical faceted menus, if available, are shown on the left side below the navigation summary. Values applied across facets are normally applied conjunctively, whereas values applied within a given facet are applied disjunctively.

Actions on search results

You can save search results in various formats (e.g. XML, JSON, CSV and TSV) that can be used programmatically for further analysis. It is also possible to launch tools such as BLAST on search results by simply clicking on buttons labelled with tool names. On the right end side there is a link to create query alerts via RSS feeds, so that users can be notified of new or updated data.

List of search results

This is a list of the search results found by EBI Search with direct URLs to the data entries in the original portals. If your search query was for a gene or protein, links to summaries are presented above the main search results in the section titled Gene & protein summaries.

Gene & protein summaries

These summaries are a useful way to explore the data at EMBL-EBI from the perspective of a gene or protein, for certain key species. A summary collates data from several EMBL-EBI resources and is arranged along the central dogma of molecular biology. The summary page has a stable URL and can be exported/printed as a report. It incorporates information about the gene and its genomic context, its expression within an organism and in response to experimental factors, a wide range of functional information about the protein along with its interaction partners and folded 3D structure. Peer-reviewed publications and patents relevant to the gene or protein are also included. For each gene/protein, a summary comprises five individual sections that you can switch between. These are: gene, expression, protein, protein structure, and literature. You can also switch to another species in order to display equivalent information for a gene's orthologues.

Related data and alternative views

If you click on Related data for a particular entry you can explore its cross-references to other EMBL-EBI resources. Some data entries displayed in the search results can be viewed in more than one format or through more than one viewing application, if any available, using the Views menu. From the Views menu, when appropriate for further analysis, you can launch some Sequence Similarity Searching tools, such as NCBI BLAST or FASTA, and Protein Functional Analysis tools like InterProScan.

Query syntax

Overview

When the user types any text in EBI Search boxes or specifies a string in the query parameter of a  SOAP Web Services interface  call, the input is translated into an Apache Lucene query that is then executed to get the search results. The actual query executed is generated following the typical  Apache Lucene query syntax  in order to provide a generic approach avoiding complex query rearrangements.

Multiple search terms separated by white spaces are combined by default in AND logic. Therefore an input text containing for example glutathione transferase is treated as glutathione AND transferase and only entries having both terms will be found. 
The default order of results is based on their relevance, i.e. the proximity of the terms in the entries.

In the table below an overview of some useful query syntax elements is presented.

Element Meaning Usage Example Notes
AND In addition to term1 AND term2 glutathione AND transferase Matches entries where both  glutathione  and  transferase  occur.
OR Equivalence term1 OR term2 glutathione OR transferase Matches entries where either  glutathione  or  transferase  occur.
NOT Exclusion term1 NOT term2 coding NOT fragment Matches entries containing  coding  but not fragment.
* Wildcard partialTerm* gluta* Matches for instance glutathione, glutamate, glutamic.
" " Exact match "quoted text" "x-ray diffraction" Exact matching for entries containing x-ray diffraction.
( ) Grouping (text) (reductase OR transferase) AND glutathione  
Field: Field-specific search fieldId:term description:dopamine Matches for a field  description  containing dopamine.

Other search engines may provide similar capabilities in their query languages, however the results obtained can differ from EBI Search. These differences are usually related to the way data are searched and the nature of the query systems.
 

Escaping special characters

The following characters within queries require to be escaped (using a ' \ ' before the character to escape) in order to be correctly interpreted:

+ - && || ! ( ) { } [ ] ^ " ~ * ? : \ /

Since Apache Lucene supports regular expression searches (matching a pattern between forward slashes) the forward slash ' / ' has become a special character to be escaped. For example to search for cancer/testis use the query cancer\/testis. If special characters are not escaped the actual query performed may be different from what expected.

Query examples

Following the aforementioned query syntax, users can easily search and filter results according to data content and characteristics.
A few examples of queries that can be performed using EBI Search are listed below.

Identifiers containing colons

As mentioned before colons are to be considered special characters. Some data resources though, such as Gene Ontology (GO), have colons ' : ' in their main identifiers. Unfortunately when the format [PREFIX]:[number] is adopted for a search field, some issues may arise in query parsing since colons are interpreted as special separators by default. Despite the fact that some implicit escaping mechanism is in place the advice is to either quote or escape adequately the search terms in case of doubt.

For instance to search for all the cross-references called GO that refer to the entry identifier GO:0005730 you have two equivalent options:

  • GO:GO\:0005730
  • GO:"GO:0005730"

Query parsing error

In case of query parsing errors the original search text is placed in quotes. For instance, if the user tries to search for an expression with a round parenthesis not closed, such as gene(, the actual query performed will be "gene(".

Notes

Please consider the following notes:

  • Fuzzy queries are deactivated (i.e. query gene~0.8 is executed as quoted).
  • Regular expression queries are deactivated (i.e. query /gene/ is executed as quoted).
  • Prefix and wildcard queries need at least 3 characters, such as for hum*, otherwise they are executed as quoted.
  • Range queries can only be applied to a specific field (i.e. publication_date:[2010 TO 2011]).
  • Range queries without a field specified are executed as quoted text (i.e. [2010 TO 2011]).
  • If no field is explicitly indicated, the actual query is executed through an expansion of the search text to all fields for each domain.
  • The execution time for a given depends on query complexity and scope.

Analyzers

Apache Lucene default analyzers are very good at English text. The EBI Search default analyzer is essentially an Apache Lucene StandardAnalyzer which does not remove stopwords and can recognise special expressions such as email addresses.

Some specialised analyzers are also available for:

  • Chemical formulas:  H2O  ,  C16H28N2O11
  • Chemical names:  16-hydroxyestrone
  • Reactions:  "cyclohexylamine + sulfate => cyclohexylsulfamate + H2O"
  • Dates:  18-MAY-2012

In general the same analyzers are used during both indexing and the searching phases.

Relevance

The order in the list of results presented on the web pages for a search is mainly based on Apache Lucene scoring system: hits with more close matches are more relevant.

Although EBI Search can be configured to boost some particular domains and/or individual fields, it is recommended to use whenever possible a boosting factor at search time. To boost a term at search time use the caret symbol ' ^ ' with a boost factor (a number) at the end of the term you are searching. The higher the boost factor, the more relevant the term will be. For instance if you wish to weigh more the first term in the query prostate AND cancer you can reformulate the query in this way: prostate^4 AND cancer.

SOAP / RESTful Web Services API

EBI Search resources can be accessed programmatically using SOAP / RESTful Web Services interfaces.

  • You can generate your own client from the public  WSDL / WADL  or take  SOAP sample clients / RESTful sample clients  as a reference implementation.
  • The Web Services API covers almost everything users can do on the Web interface.
  • Users should design Web Services workflows carefully, avoiding useless calls which can impact on performances.

Query alerts via RSS feeds

EBI Search allows users to subscribe to query results via RSS feeds. In search result pages query alert links are shown for each category and domain.

What are alerts for?

Query alerts enable users to stay up-to-date with information and data in particular areas of interest, providing means of monitoring new or updated content.

How are alerts set up?

Alerting systems usually send notifications through emails. EBI Search instead is based on RSS format. Various RSS readers can be used and modern browsers can also deal with and render RSS content. To set up an alert on web result pages, click the Create alert button and bookmark or save the resulting URL using an RSS client.

How do I check for updates?

It is possible to check for updates using: 

  • a browser: the RSS feed URL can be stored as a bookmark on a browser. Going back to that bookmark will re-run the query against EBI Search server.
  • an RSS client: stored feeds get re-run and updated every time the user requests to view them.

Examples of alerts

Query alerts can be useful to retrieve the latest publications related to a particular topic in the literature resources; to obtain lists of the latest reviewed proteins in UniprotKB; to get the latest new or updated macromolecular structures in the PDBe.
Example feeds:

References

[1] Squizzato S., Park Y.M., Buso N., Gur T., Cowley A., Li W., Uludag M., Pundir S., Cham J.A., McWilliam H., Lopez R. (2015) 
The EBI Search engine: providing search and retrieval functionality for biological data from EMBL-EBI.  
Nucleic Acids Research published online on April 8, 2015. 
Abstract  DOI:  10.1093/nar/gkv316    full-text PDF .

[2] Valentin F., Squizzato S., Goujon M., McWilliam H., Paern J. and Lopez R. (2010) 
Fast and efficient searching of biological data resources — using EB-eye.  
Briefings in Bioinformatics Advance Access published online on February 11, 2010. 
Abstract  DOI:  10.1093/bib/bbp065    full-text PDF .

[3] Goujon M., Valentin F., Miyar T., McWilliam H. and Lopez, R. (2008) 
The EB-eye  
EMBnet.news 13.4: 18-21 December 2007. 
full-text PDF