How is the data structured?
Data classes and taxonomic divisions
We have learnt that ENA consists of three very large databases: EMBL-Bank, SRA and TA. Because EMBL-Bank is the only one with assembled data, it is the most complex of the three databases. EMBL-Bank contains multiple types of data, everything from whole genome shotgun assemblies to cDNA libraries and patent data. To make the data easier to access, EMBL-Bank has structured the data in two ways:
By Data Class, which divides entries according to the type of data or method used to obtain it. For example, the WGS (whole genome shotgun) data class.
By Taxonomic Division. For example, the HUM (human) taxonomic division.
A sequence can only belong to one class and to one division. For data class, a sequence will belong to the class that best describes that sequence. For taxonomic division, the most specific taxonomy is used to categorise a sequence.
EMBL-Bank is unique in using data classes and taxonomic divisions to create intersecting slices of data. In other words, the database is first divided into data classes, then the data is subdivded by taxonomic division (Figure 8). This allows you search or download smaller volumes of data. By contrast, most other databases will allow you to access data divided by data class or by taxonomic division, but not by both.
Figure 8. Data is first split into classes, then it is split into intersecting slices by taxonomy.
The majority data classes include STD (standard), CON (constructed) and WGS (Whole Genome shotgun).
In ENA, transcript information is found in both the EMBL-Bank and the SRA databases. Within EMBL-Bank, transcript information can be found listed under several data classes, depending upon how the sequence was obtained:
EST class contains raw expressed sequence tag sequence that is of variable quality (single-pass reads).
- HTC class are high-throughput assembled transcript sequences.
- TSA class are transcriptome shotgun assembly sequences consisting of derived from the SRA or TA databases.
- STD (standard) class can contain transcript information, and can be search using the 'mol_type' field, filtering for ‘mRNA’.
Sequences from related species are grouped together by taxonomic divisions. In all, there are 15 taxonomic divisions that are used in ENA:
|HUM human||FUN fungi||VRL viral|
|MUS mouse||INV invertebrate||ENV environmental|
|ROD rodent||PLN plant||SYN synthetic|
|MAM mammal||PRO prokaryote||TGN transgenic|
|VRT vertebrate||PHG phage||UNC unclassified|
Each sequence is only assigned to one taxonomic division (otherwise the sequence would be duplicated in different parts of the database). However, as you can see from the list above, some taxonomic divisions overlap. Therefore, sequences are classified according to the most specific division. For example, a mouse sequence could belong to MUS, ROD, MAM or VRT divisions, but it is classified as MUS as this is the most specific category (lowest taxonomic node).
Once a sequence is placed in the most specific taxonomic division, it is then excluded from all remaining taxonomic divisions so as not to duplicate data. For example, the mouse sequence is found in the MUS divisions, therefore it is excluded from the ROD, MAM and VRT divisions, even though a rat is a mammal and a vertebrate (Figure 9).
Figure 9. Sequences are assigned to the most specific taxonomic division.
These exclusions will become important when we look at 'How to search and browse ENA'. For example, if you want mouse sequences, you must be careful not to select ROD (rodent), as mouse sequences are excluded from this division. Later we will learn that there are exceptions to this rule when searching with the ENA browser, which merges taxonomic divisions together to make searching simpler.
What if no taxonomy is associated with a sequence?
All EMBL-Bank entries are assigned to a taxonomic division. However, some sequences are not associated with any formal taxonomy (e.g. synthetic sequences), while other sequences are derived from organisms whose taxonomy could not be determined (e.g. metagenomic sequences). ENA has special taxonomic divisions to overcome this problem:
ENV: contains metagenomic and environmental sequences where the taxonomy is unknown or is from a high taxonomic node such as a kingdom (Figure 10).
- The taxonomy is displayed as: Organism = uncultured bacterium, uncultured eukaryote, metagenome...
SYN contains synthetic or experimentally altered sequences.
The taxonomy is displayed as: Organism = synthetic construct.
TGN contains transgenic sequences. Transgenics occur when genetic material from one species is transferred to another species, either naturally or through genetic engineering techniques.
Taxonomy is provided for both donor and recipient organisms.
UNC contains unclassified sequences typically obtained from patents, for which taxonomic data is not always available.
The taxonomy is displayed as: Organism = unidentified