Help & Documentation

What is the EBI Search?

The EBI Search engine, also named as EB-eye [1], [2]:

  • Provides text search functionality and uniform access to resources and services hosted at the EMBL-EBI.
  • Is based on the consolidated  Apache Lucene  technology.
  • Exposes both a Web and  SOAP / RESTful Web Services interfaces.
  • Provides inter-domain navigation via a network of cross-references.

What can you search for?

The EMBL-EBI hosts a vast amount of molecular data and other information that is indexed by the EBI Search engine. This includes gene and protein sequences, protein families, structures, gene expression data, protein interactions, pathways and small molecules, to name a few. You can also search across the academic literature and patents as well information about our institute and staff members. In the EBI Search box you can enter any meaningful term to find relevant information typing for example accession numbers/identifiers (such as VAV_HUMAN), gene symbols (for instance tpi1), species or keywords. For more complex queries you can have a look at the EBI Search  query syntax .

Definition of a domain

A domain is a data resource under the EBI Search engine.

  • Domains are organised in a hierarchy.
  • A category is a first-level domain such as Genomes, Nucleotide sequences etc.
  • A leaf domain is a dataset node with no children.

Search results page

When you enter some text in the EBI Search box you get back a results web page for the query just executed. This page is organised into three main column sections: on the left there is a summary of the hits per category/domain with available facets displayed below; in the middle there is the actual list of search results; on the right related data and alternative views are shown.

Summary

The navigation summary on the left allows users for a compact view and easy navigation across different categories and domains. It provides a means for exploring the search results grouped in relevant subsets and drilling down the scope of the results.

Facets

Vertical faceted menus, if available, are shown on the left side below the navigation summary. Values applied across facets are normally applied conjunctively, whereas values applied within a given facet are applied disjunctively.

List of search results

This is a list of the search results found by the EBI Search engine with direct URLs to the data entries in the original portals. If your search query was for a gene or protein, links to summaries are presented above the main search results in the section titled Gene & protein summaries.

Gene & protein summaries

These summaries are a useful way to explore the data at the EMBL-EBI from the perspective of a gene or protein, for certain key species. A summary collates data from several EMBL-EBI resources and is arranged along the central dogma of molecular biology. The summary page has a stable URL and can be exported/printed as a report. It incorporates information about the gene and its genomic context, its expression within an organism and in response to experimental factors, a wide range of functional information about the protein along with its interaction partners and folded 3D structure. Peer-reviewed publications and patents relevant to the gene or protein are also included. For each gene/protein, a summary comprises five individual sections that you can switch between. These are: gene, expression, protein, protein structure, and literature. You can also switch to another species in order to display equivalent information for a gene's orthologues.

Related data and alternative views

If you click on Related data for a particular entry you can explore its cross-references to other EMBL-EBI resources. Some data entries displayed in the search results can be viewed in more than one format or through more than one viewing application, if any available, using the Views menu. From the Views menu, when appropriate for further analysis, you can launch some Sequence Similarity Searching tools, such as NCBI BLAST or FASTA, and Protein Functional Analysis tools like InterProScan.

Query syntax

Overview

When the user types any text in the EBI Search box or specifies a string in the query parameter of a  SOAP Web Services interface  call, the input is translated into an Apache Lucene query that is then executed to get the search results. The actual query executed is generated following the typical  Apache Lucene query syntax  in order to provide a generic approach avoiding complex query rearrangements.

Multiple search terms separated by white spaces are combined by default in AND logic. Therefore an input text containing for example glutathione transferase is treated as glutathione AND transferase and only entries having both terms will be found. 
The default order of results is based on their relevance, i.e. the proximity of the terms in the entries.

In the table below an overview of some useful query syntax elements is presented.

Element Meaning Usage Example Notes
AND In addition to term1 AND term2 glutathione AND transferase Matches entries where both  glutathione  and  transferase  occur.
OR Equivalence term1 OR term2 glutathione OR transferase Matches entries where either  glutathione  or  transferase  occur.
NOT Exclusion term1 NOT term2 coding NOT fragment Matches entries containing  coding  but not fragment.
* Wildcard partialTerm* gluta* Matches for instance glutathione, glutamate, glutamic.
" " Exact match "quoted text" "x-ray diffraction" Exact matching for entries containing x-ray diffraction.
( ) Grouping (text) (reductase OR transferase) AND glutathione  
Field: Field-specific search fieldId:term description:dopamine Matches for a field  description  containing dopamine.

Other search engines may provide similar capabilities in their query languages, however the results obtained can differ from the EBI Search engine. These differences are usually related to the way data are searched and the nature of the query systems.
 

Escaping special characters

The following characters within queries require to be escaped (using a ' \ ' before the character to escape) in order to be correctly interpreted:

+ - && || ! ( ) { } [ ] ^ " ~ * ? : \ /

Since Apache Lucene supports regular expression searches (matching a pattern between forward slashes) the forward slash ' / ' has become a special character to be escaped. For example to search for cancer/testis use the query cancer\/testis. If special characters are not escaped the actual query performed may be different from what expected.

Identifiers containing colons

As mentioned before colons are to be considered special characters. Some data resources though, such as Gene Ontology (GO), have colons ' : ' in their main identifiers. Unfortunately when the format [PREFIX]:[number] is adopted for a search field, some issues may arise in query parsing since colons are interpreted as special separators by default. Despite the fact that some implicit escaping mechanism is in place the advice is to either quote or escape adequately the search terms in case of doubt.

For instance to search for all the cross-references called GO that refer to the entry identifier GO:0005730 you have two equivalent options:

  • GO:GO\:0005730
  • GO:"GO:0005730"

Query parsing error

In case of query parsing errors the original search text is placed in quotes. For instance, if the user tries to search for an expression with a round parenthesis not closed, such as gene(, the actual query performed will be "gene(".

Notes

Please consider the following notes:

  • Fuzzy queries are deactivated (i.e. query gene~0.8 is executed as quoted).
  • Regular expression queries are deactivated (i.e. query /gene/ is executed as quoted).
  • Prefix and wildcard queries need at least 3 characters, such as for hum*, otherwise they are executed as quoted.
  • Range queries can only be applied to a specific field (i.e. publication_date:[2010 TO 2011]).
  • Range queries without a field specified are executed as quoted text (i.e. [2010 TO 2011]).
  • If no field is explicitly indicated, the actual query is executed through an expansion of the search text to all fields for each domain.
  • The execution time for a given depends on query complexity and scope.

Analyzers

Apache Lucene default analyzers are very good at English text. The EBI Search default analyzer is essentially an Apache Lucene StandardAnalyzer which does not remove stopwords and can recognise special expressions such as email addresses.

Some specialised analyzers are also available for:

  • Chemical formulas:  H2O  ,  C16H28N2O11
  • Chemical names:  16-hydroxyestrone
  • Reactions:  cyclohexylamine + sulfate => cyclohexylsulfamate + H2O
  • Dates:  18-MAY-2012

In general the same analyzers are used during both indexing and the searching phases.

Relevance

The order in the list of results presented on the web pages for a search is mainly based on Apache Lucene scoring system: hits with more close matches are more relevant.

Although the EBI Search can be configured to boost some particular domains and/or individual fields, it is recommended to use whenever possible a boosting factor at search time. To boost a term at search time use the caret symbol ' ^ ' with a boost factor (a number) at the end of the term you are searching. The higher the boost factor, the more relevant the term will be. For instance if you wish to weigh more the first term in the query prostate AND cancer you can reformulate the query in this way: prostate^4 AND cancer.

SOAP / RESTful Web Services API

The EBI Search engine resources can be accessed programmatically using SOAP / RESTful Web Services interfaces.

  • You can generate your own client from the public  WSDL / WADL  or take  SOAP sample clients / RESTful sample clients  as a reference implementation.
  • The Web Services API covers almost everything users can do on the Web interface.
  • Users should design Web Services workflows carefully, avoiding useless calls which can impact on performances.

References

[1] Valentin F., Squizzato S., Goujon M., McWilliam H., Paern J. and Lopez R. (2010) 
Fast and efficient searching of biological data resources — using EB-eye.  
Briefings in Bioinformatics Advance Access published online on February 11, 2010. 
Abstract  DOI:  10.1093/bib/bbp065    full-text PDF .

[2] Goujon M., Valentin F., Miyar T., McWilliam H. and Lopez, R. (2008) 
The EB-eye  
EMBnet.news 13.4: 18-21 December 2007. 
full-text PDF