How to search ENA with taxonomy

How to use the taxonomy portal

To look for information on what sequence is available for a species, the taxonomy portal allows easy navigation via a taxonomic tree and a summary of the sequence available at each taxonomic node (see 'Navigating the taxonomic tree'). The taxonomy portal allows you to look at the total coverage for any organism, or for any node in the taxonomic tree. 

When doing a text search using taxonomy, it is best to use the scientific taxonomic name as it is more precise. However, you can also use a common name, but you are more likely to get a range of different taxa. For example, Figure 22 shows a query on the common name 'honey bee', which returns results for four different taxa.

Results of an ENA browser text search on 'Honey bee'; taxonomy results are found in the 'Other' section

Figure 22. Results of an ENA browser text search on 'Honey bee'; taxonomy results are found in the 'Other' section.

Notes

[A] Taxa results provides the taxonomic portal summaries.

[B] By expanding the Taxa results, you get a list of closest matching taxonomies.

A closer look at the taxonomy portal

By expanding the 'Taxa results' section, you can see a summary of the nucleotide information available for a taxon (Figure 23):

Taxonomy portal detailing the nucleotide information available for Apis mellifera  (honey bee)

Figure 23. Taxonomy portal detailing the nucleotide information available for Apis mellifera  (honey bee).

Notes

[A]  Taxonomy for which information is displayed.

[B] Taxonomy Portal tab (current view) provides a summary of the nucleotide information available for [A].

[C] Navigation  tab provides a taxonomy tree so you can navigate between taxonomy nodes.

[D] Genetic code  tab provides the translation tables used to translate the coding sequences for this species.

[E]  Summary of the nucleotide information available.

Navigating the taxonomy tree

The taxonomy tree also provides an easy way to explore what nucleotide information is available for related taxonomic groups (Figure 24):

Navigation tab showing the taxonomic tree for Apis mellifera  (honey bee)

Figure 24. Navigation tab showing the taxonomic tree for Apis mellifera  (honey bee).

Notes

[A]  The taxonomic tree displays the complete lineage of a taxon. A summary of nucleotide data is available for each node in the tree.

[B]  Each node can be expanded in order to navigate to related taxa. In this example, the node for Apis [genus] has been expanded.

[C]  The current position in the taxonomic tree is highlighted in black.

Troubleshooting

Restricting your search by taxonomy is a good way of cutting out unwanted data, especially if all you need are sequences from one or a few related species. However, you need to be careful that you don't exclude relevant data from your search. There are several points to consider (Figures 25-28):

How specific is the taxonomy you require?

ENA contains information on strains, varieties and breeds for many taxonomic groups, whether or not the sequences varies between them.

There are several dog sequences in EMBL-Bank; this one is for the Alsatian breed

Figure 25. There are several dog sequences in EMBL-Bank; this one is for the Alsatian breed.

What if the sequence you require has no taxonomy associated with it?

ENA contains sequences for which no species has been identified, such as those from environmental studies, synthetic constructs, transgenics and patents. These are found under special taxonomic divisions (see How is the data structured).

EMBL-Bank entry displaying the source information for an unknown bacterium

Figure 26. EMBL-Bank entry displaying the source information for an unknown bacterium.

Could your sequence be classified in a different way?

Some sequences are difficult to classify and require caution when searching so as not to miss valuable data. For example, endogenous viruses are usually classified by the host organism in which they were sequenced, or as being viral if isolated and sequenced. Therefore, endogenous viruses should be searched under both the VRL (virus) division and the taxonomic division of the host organism.

EMBL-Bank entry of the endogenous virus gamma-3, which is classified as being from the organism Canis lupus familiaris (dog) in the MAM (mammal) division, because it was sequenced as part of the dog genome

Figure 27. EMBL-Bank entry of the endogenous virus gamma-3, which is classified as being from the organism Canis lupus familiaris (dog) in the MAM (mammal) division, because it was sequenced as part of the dog genome.

Are you looking to compare sequences from a group of related organisms?

Taxonomic divisions help divide the data into manageable chunks, but be careful which search engine you use:

    • The ENA browser merges divisions together to provide results that are complete for a taxonomic group; for example, if you search 'Rodents' you will include ROD + MUS divisions.

    • The Sequence Search & Analysis tools keep the divisions separate to allow more flexibility when searching; for example, if you search 'EMBL Rodent' you will only include the ROD division (to search ROD + MUS you must select both).

 Does the organism you are interested in have any alternative names?

Organisms can sometimes have different accepted names, or synonyms, because they were referenced in the literature differently or because different resources use different taxonomic classifications; this is important when you link out to external resources.

Are you certain the taxonomic name for your organism is unique?

Sometimes different organisms can share the same taxonomic name (homonyms).

ENA taxonomy search reveals two organisms with the same genus species name

Figure 28. ENA taxonomy search reveals two organisms with the same genus species name: Agathis montana de Laub is a conifer tree, while Agathis montana Shest is a wasp.

Where is the taxonomic information in an entry?

When viewing an EMBL-Bank entry, it is good practice to check the Source Feature(s) section to ensure the sequence you are looking at is what you expect it to be (Figure 29). In this section you can also pick up valuable additional information, such as sample information, the origin of each region of a transgenic sequence, or notes on how a sequence was isolated, or where it occurs in the genome.
 
 Source Feature(s) section of EMBL-Bank entry FR695060, which provides detailed information on where the sequence was isolated
 
Figure  29. Source Feature(s) section of EMBL-Bank entry FR695060, which provides detailed information on where the sequence was isolated.