How to search ENA with sequence

Sequence search

Only the EMBL-Bank database of ENA can be searched with a nucleotide sequence; the SRA and TA databases contain raw data composed of very short redundant sequences that make them unsuitable for sequence searching.

The ENA browser allows you to search the entire EMBL-Bank database using either a DNA or RNA query sequence. Searching with a sequence is useful if you:

  • have a sequence but are not sure of the gene name;
  • have an unknown sequence you want to identify;
  • want to find orthologues of a gene in other species, or paralogues within a species;
  • want to identify sequence variants for a gene, including disease or mutant alleles;
  • want to check whether you have identified a novel sequence.

The sequence search box on the ENA Browser will accept either an EMBL-Bank accession number (where it will automatically insert the sequence from that accession):

 ...or a nucleotide sequence:

 

The nucleotide sequence can be in plain text (i.e. straight sequence with no header) or FASTA format (as above).

This simple sequence search will query both EMBL-Bank, Ensembl and Ensembl Genomes for similar sequences.

Accessing the advanced sequence search

There is a link from the ENA browser to an advanced sequence search page (Figure 17), which allows you to refine your search to a specific section of EMBL-Bank, Ensembl or Ensembl Genomes, as well as to change search method.

Link from the ENA browser to the advanced sequence search page

Figure  17. Link from the ENA browser to the advanced sequence search page.

Notes

[A] Advanced Search link on the ENA browser will direct you to the advanced sequence search page. 

Steps

 1. Open the ENA browser in a new window.

2. Click on the ‘Advanced Search’ option [A]

 

Advanced sequence search

There are several search options available on the advanced search page that allow you to refine your search (Figure 18). 

Advanced sequence search page; at the foot of the page you can change the search method

Figure 18. Advanced sequence search page; at the foot of the page you can change the search method.

Notes

[A] Search box will take a sequence or an EMBL-Bank accession number (same as for the simple sequence search).

[B] File Upload allows you to search on sequences held in a file (see help box below for allowable sequence formats).

[C] Search modes allow you to select different types of searches.

[D] Collection allows you to restrict your search to a section of EMBL-Bank, Ensembl, or Ensembl Genomes (for example, an EMBL-Bank taxonomic division or a species in Ensembl).  If nothing is selected, then the search will be carried out on all EMBL-Bank + all Ensembl + all Ensembl Genomes, which could be a lengthy search.

  • When an EMBL-Bank taxonomic division is selected, it will automatically include  more specific taxonomic divisions in addition to the selected division.
  • For example, searching the Vertebrate division will include sequences in the Human, Mouse, Rodent and Mammal divisions, so that all vertebrate species  will be included in the search.

[E] Masking gives you the option of:

  • no masking (default);
  • soft masking, where the user puts repetitive sequences in lower case to exclude them from the search, but which still remain visible in the resulting alignments.

[F] Further options allows you to search new EMBL-Bank sequences only.

Steps

1. Enter the EMBL-Bank accession number 'AAA62278' into search box [A].

2. In the drop-down menu for Collection [C], select 'All Sequences'.

3. Click 'Submit Query'.

Help

Sequences can be in any of the EMBOSS sequence formats, which includes FASTA format. For a list of these formats, please see the EMBOSS User Manual

Warning

CAUTION: Microsoft Word format is NOT a sequence format.

Sequence search results page

The results of a sequence search are listed in tables, where the matches for EMBL-Bank and Ensembl are separated so they can be dealt with independently (Figure 19).

Example of a results page for a sequence search using EMBL-Bank CDS accession ‘AAB07223.1’ against both EMBL-Bank and Ensembl

Figure 19. Example of a results page for a sequence search using EMBL-Bank CDS accession ‘AAB07223.1’ against both EMBL-Bank and Ensembl.

Notes

[A] Search Completion Bar goes green as the search proceeds.

[B] Query Sequence Details provides the name, description and length of the search sequence.

[C] Ensembl Results are separated from [D] EMBL-Bank Results so they can be dealt with independently.

[E] Filter allows you to reduce search results using a text-based filter.

• For example, filtering using the term ‘Homo sapiens’ will return only human entries.
• Note that there are separate filters for Ensembl and EMBL-Bank results.

[F] Show alignments displays the alignments for the matches in the table (Note: click on 'Next' to view further results).

A closer look at the sequence search results page

Taking a closer look at the search results, you will see that they are ordered by e-value, starting with the lowest e-value which is the most significant match (Figure 20).

A close-up of the Ensembl table in the advanced search results

Figure 20. A close-up of the Ensembl table in the advanced search results. 

Notes

[A] Select columns allows you to hide/show columns in the table.

[B] Alignment Length column shows the length of the matching region between the query and target sequences.

[C] Target Length column shows the length of the target sequence.

[D] Identity (%) column shows the % of nucleotides that are identical in the query and target sequences.

[E] e-value column displays a calculated estimate of the significance of the match.

Information

It is better to use the e-value as a measure of how significant a match is rather than % identity, because the e-value takes account of the database size and the length of the query sequence, in addition to the number of matching nucleotides.

Information

Partial and full-length matches: to determine whether a match covers the full length of the query and/or target sequence, or only part of the sequence (partial match), take a look at the Query Sequence Length (Figure 19 [B]), the Target Sequence Length (Figure 20 [E]) and the length of the match between them (Figure 20[D]). In addition, viewing the alignments (by expanding the description line, Figure 20 [C], or by expanding ‘Show all alignments’, Figure 22 [C]) will show you where the query and target sequences match.

Additional sequence search tools

Additional search engines are available on the Sequence Similarity Searching page. These include BLAST, PSI-BLAST, FASTA, SSEARCH and specialised search programs. If you wish to search the entire EMBL-Bank and/or Ensembl database, then the fast Exonerate search on the ENA browser is your best option.

However, you might want to consider trying one of the search programs on the Sequence Similarity Searching page if you want to:

  • search a subsection of EMBL-Bank not available through the ENA browser search (Figure 21);
  • search other databases with nucleotide information, for example patent, structure or immunoglobulin databases;
  • carry out a specialised type of search, for example searching a protein database with a DNA sequence (FASTX or BLASTX), or searching with a set of short oligonucleotide sequences (FASTM);
  • search with a very short nucleotide sequence, where a true Smith-Waterman program such as SSEARCH would perform better;
  • be able to adjust search parameters to fine-tune your search query.

Sequence Similarity & Analysis search page detailing the selection of databases available, including EMBL-Bank subdivisions

Figure 21. Sequence Similarity & Analysis search page detailing the selection of databases available, including EMBL-Bank subdivisions.

Notes

[A] Databank selection includes all the taxonomic divisions and data classes from  EMBL-Bank, plus additional databases such as:

  • Immunoglobulin databases IMGT/LIGM-DB, IMGT/HLA, IPD-KIR & IPD-MHC;
  • Human genome variation database HGVBASE;
  • Patent sequence databases NR Patent DNAs Level-1 & Level-2;
  • Nucleotide Structure Sequences.

[B] Databank Selection (close-up) highlighting the different subdivisions of EMBL-Bank that can be queried.

[C] EMBL-Bank Taxonomic Divisions allows you to restrict your search to a specific taxonomic division, for example 'EMBL VRT' (EMBL-Bank vertebrate sequences).

[D] EMBL-Bank Divisions/Classes Cross-section allows you to restrict your search to a specific taxonomic division AND a specific data class, for example 'EMBL GSS VRT' (EMBL-Bank GSS class of vertebrate sequences).

  • The taxonomic division/data class cross-sections are only available through Sequence Similarity & Analysis tools.

[E] EMBL-Bank Data Classes allows you to restrict your search to a specific data class, for example 'EMBL GSS' (EMBL-Bank GSS class).

Information

A full course on Sequence Search Strategies will be here soon.