Sequence Search

In March 2015, ENA introduced a new sequence search service built on EBI's central BLAST search service. Our interface allows users to easily select which subset of INSDC sequences to search against, including the ability to limit searches by dataclass or tax division.

Programmatic users should use the central EBI BLAST SOAP or REST web services. For guidance on which databases to use to perform similar searches as those offered through our interface, please contact datasubs@ebi.ac.uk.

Query sequence

The query sequence can be pasted into the text area provided or uploaded from a file.  Alternatively, if the sequence has a public INSDC accession, this can be given and the sequence will be fetched from ENA.  By default, the entire length of the query sequence is used in the search, however there is also an option to limit it to a fragment of the sequence provided.  

Sequences to search against

Option Description
Assembled and annotated sequences Records which have been generated from raw sequencing data and contain functional annotation. They have usually undergone various steps of quality control and may have the functional annotation experimentally validated through wet lab work. Note that WGS sequences are excluded from the searchable set, as are all CON sequences that are not prokaryotic.
Geo-referenced sequences A subset of assembled and annotated sequences consisting of records with latitude and longitude coordinates.
Barcode sequences A subset of assembled and annotated sequences consisting of records which conform to Consortium for the Barcode of Life (CBoL) standards.
Non-coding sequences Annotated non-coding sequences derived from assembled and annotated sequences consisting of records containing specific non-coding features.
Coding sequences Annotated coding sequences derived from assembled and annotated sequences consisting of records containing coding features.
Vectors (Emvec) A reference set of plasmid vectors and tag sequences, etc., that can be used for screening and filtering of data for analysis and submission.

Searches against assembled and annotated, coding and non-coding sequences can be limited by taxonomic division or dataclass. However, this will also limit the search to only those sequences included in the most recent release as there are limitations to offering this for new and updated sequences at this time.

Programs

There are three different BLAST programs that can be run.  Please make note of what sequence type is required for each program as using the incorrect program for the type of sequence used in the search will result in an error.

Program Description Sequence Type
blastn Compares a nucleotide sequence (DNA/RNA) to a nucleotide sequence database. DNA/RNA
tblastx Compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Please note that tblastx is extremely slow and cpu-intensive. DNA/RNA
tblastn Compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames. protein

Parameters

Once the required program has been selected, the search can be submitted using all of the default settings for that program.  If you're wanting to tailor the number of search results returned and how the alignments are scored, you can alter any of the parameters listed in the following table, by clicking on the "More options" link, under the program selection. 

Parameter Description Program(s)
Result options
Maximum scores Maximum number of match score summaries reported in the result output. blastn, tblastx, tblastn
Maximum alignments Maximum number of match alignments reported in the result output. blastn, tblastx, tblastn
Expect threshold Limits the number of scores and alignments reported based on the expectation value. This is the maximum number of times the match is expected to occur by chance. blastn, tblastx, tblastn
Alignment views Formating for the alignments.  See the table below for more information on the options available. blastn, tblastx, tblastn
Filter low complexity regions Filter regions of low sequence complexity. This can avoid issues with low complexity sequences where matches are found due to composition rather than meaningful sequence similarity. However in some cases filtering also masks regions of interest and so should be used with caution. blastn, tblastx, tblastn
Scoring options
Match/mismatch scores The match score is the bonus to the alignment score when matching the same base. The mismatch is the penalty when failing to match. blastn
Drop off The amount a score can drop before gapped extension of word hits is halted blastn, tblastx, tblastn
Gap existence cost Penalty taken away from the score when a gap is created in sequence. Increasing the gap existence cost will decrease the number of gaps in the final alignment. blastn, tblastx, tblastn
Gap extension cost Penalty taken away from the score for each base or residue in the gap. Increasing the gap extension cost favors short gaps in the final alignment, conversly decreasing the gap extension cost favors long gaps in the final alignment. blastn, tblastx, tblastn
Matrix The substitution matrix used for scoring alignments when searching the database. tblastx, tblastn
Composition-based adjustments Whether to use composition-based adjustments, and if so which kind. tblastn
General options
Align using gaps If selected, the program will perform an alignment using gaps. Otherwise, it will report only individual HSP where two sequences match each other, and thus will not produce alignments with gaps. blastn, tblastx, tblastn
Translation table Genetic code to use in translation of query sequence tblastx

Alignment views

There are several options for presentation of the aligments in the search result, each of these are described in the table below.

Option Description
pairwise The query and match are output as a pairwsie alignment with a consensus line between the two sequences. In the consensus the match states are represented as: identical match as the base/residue, similarity as a '+' and mismatch as a space.
Query-anchored identities The matches found are shown relative to the ungapped query sequence as a differences to the query. Identities appear as dots (.), similarities in upper case, mismatches in lower case and gaps as dash (-). Insertions are indictated with a line pointing to the insertion site with the inserted sequence on another line.
Query-anchored non-identities The matches found are shown relative to the ungapped query sequence as a differences to the query. Identities and similarities appear in upper case, mismatches in lower case and gaps as dash (-). Insertions are indictated with a line pointing to the insertion site with the inserted sequence on another line.
Flat query-anchored identities The matches found are shown relative to the gapped query sequence as a differences to the query. Identities appear as dots (.), similarities in upper case, mismatches in lower case and gaps as dash (-).
Flat query-anchored non-identities The matches found are shown relative to the gapped query sequence as a differences to the query. Identities and similarities appear in upper case, mismatches in lower case and gaps as dash (-).
BLASTXML Output NCBI BLAST XML instead of a plain text report.

Submitting your search

Some searches can take several minutes, especially if you are searching against a large set of sequences (e.g. all assembled and annotated sequences).  As a visual cue to let you know that a search is still running, we display an ENA "loading" icon in the place of the submit button and block the sequence search form from any editing.  Once the search is complete, this icon will disappear, the submit button will return and you should also have a new window open with the search results.  If you have the pop-up blocker enabled, you will need to disable it to get the results window.  Alternatively, you can select the option to receive an email with a link to the results once the search is complete. 

Results

Your results will either be opened as a new window or emailed to you as a link once the search has completed.  Your search will be available for 7 days, therefore we suggest you download your BLAST results if you think you might need them for longer.  The results page is divided into several tabs, each of which is described below. 

Summary Table

Column Description
Align Checkboxes to select results for further actions, eg view in alignment, download or send to multiple sequence alignment
DB:ID The BLAST "database" and INSDC accession of the database sequence. The former can be ignored as it is an internal BLAST service representation of sequence grouping and the accession is used to link to the record in ENA.
Source Information about the record including references to other resources across EBI.
Length The length of the sequence.
Score The literal score of the alignment
Identities % The percentage of identical bases aligned between the query and database sequences.
Positives % The percentage of aligned bases that score positively in the substitution matrix (similar bases)
E() The Expect value (E-value) for the alignment. This is a measure of how likely you are to find the alignment by chance.

Tool output

This tab displays the raw output from BLAST and can be downloaded as either a text or XML file.

Visual output

This tab gives a visual representation of the sequence alignments, showing which portion of the sequences are aligning and colour coding by E-value. Selecting any accession from the left hand side (via a mouse click) will take you to that alignment in the tool output tab.

Result summary

While you can select your preferred alignment format before performing your BLAST search, all of the different NCBI BLAST outputs are available to download from the result summary tab.

Submission details

All of the original parameters used for your search are available here as are links to the query sequence used and the output results. This information is useful when you may wish to run the exact query again, if you wish to run the query programmatically or via the command line tool, or if you need to contact EBI with help regarding your sequence search.

Latest ENA news

11 Oct 2017: Read data download issues resolved

Read data download issues previously affecting ftp.sra.ebi.ac.uk and fasp.sra.ebi.ac.uk services now resolved.

06 Oct 2017: ENA read data download issues

Issues with read data download from ftp.sra.ebi.ac.uk and fasp.sra.ebi.ac.uk

04 Oct 2017: ENA Release 133

Release 133 of ENA's assembled/annotated sequences now available