How is the data structured?

Data classes and taxonomic divisions

We have learnt that ENA consists of three very large databases: EMBL-Bank, SRA and TA. Because EMBL-Bank is the only one with assembled data, it is the most complex of the three databases. EMBL-Bank contains multiple types of data, everything from whole genome shotgun assemblies to cDNA libraries and patent data. To make the data easier to access, EMBL-Bank has structured the data in two ways:

  • By Data Class, which divides entries according to the type of data or method used to obtain it. For example, the WGS (whole genome shotgun) data class.

  • By Taxonomic Division. For example, the HUM (human) taxonomic division.

A sequence can only belong to one class and to one division. For data class, a sequence will belong to the class that best describes that sequence. For taxonomic division, the most specific taxonomy is used to categorise a sequence.

EMBL-Bank is unique in using data classes and taxonomic divisions to create intersecting slices of data. In other words, the database is first divided into data classes, then the data is subdivded by taxonomic division (Figure 8). This allows you search or download smaller volumes of data. By contrast, most other databases will allow you to access data divided by data class or by taxonomic division, but not by both.

Data is first split into classes, then it is split into intersecting slices by taxonomy

Figure  8. Data is first split into classes, then it is split into intersecting slices by taxonomy.

Notes

[A] Intersecting slice of data consisting of ‘Mouse’ + ‘EST’ gives a reduced search set (other INSDC databases only provide parallel data slices, e.g. ‘EST’ for all taxonomy or ‘Mouse’ for all data classes).

  

Data classes

For assembled data, which is found in the EMBL-Bank database, each sequence is assigned to a single data class.

 

Help

A full list of data classes can be found in the help pages.

 

The majority data classes include STD (standard), CON (constructed) and WGS (Whole Genome shotgun).

Transcript data

In ENA, transcript information is found in both the EMBL-Bank and the SRA databases. Within EMBL-Bank, transcript information can be found listed under several data classes, depending upon how the sequence was obtained:

  • HTC class are high-throughput assembled transcript sequences.
  • TSA class are transcriptome shotgun assembly sequences consisting of derived from the SRA or TA databases.
  • STD (standard) class can contain transcript information, and can be search using the 'mol_type' field, filtering for ‘mRNA’.

 

 

Information

A good way to search for coding transcript data is to query the EMBL-CDS (coding sequence) dataset, which is derived from the different data classes in EMBL-Bank. Because this contains coding regions derived from both genomic and transcript records, you would need to filter results using 'mol_type = mRNA'.

 

 

Taxonomic divisions

Sequences from related species are grouped together by taxonomic divisions. In all, there are 15 taxonomic divisions that are used in ENA:

HUM human FUN fungi VRL viral
MUS mouse INV invertebrate ENV environmental
ROD rodent PLN plant SYN synthetic
MAM mammal PRO prokaryote TGN transgenic
VRT vertebrate PHG phage UNC unclassified

Each sequence is only assigned to one taxonomic division (otherwise the sequence would be duplicated in different parts of the database). However, as you can see from the list above, some taxonomic divisions overlap. Therefore, sequences are classified according to the most specific division. For example, a mouse sequence could belong to MUS, ROD, MAM or VRT divisions, but it is classified as MUS as this is the most specific category (lowest taxonomic node).

Taxonomic exclusions

Once a sequence is placed in the most specific taxonomic division, it is then excluded from all remaining taxonomic divisions so as not to duplicate data. For example, the mouse sequence is found in the MUS divisions, therefore it is excluded from the ROD, MAM and VRT divisions, even though a rat is a mammal and a vertebrate (Figure 9). 

Sequences are assigned to the most specific taxonomic division

Figure  9. Sequences are assigned to the most specific taxonomic division.

Notes

[A] MUS (mouse) division contains only mouse sequences.

[B] ROD (rodent), MAM (mammal) and VRT (vertebrate) all exclude mouse sequences because a sequence can only occur in one taxonomic division.

 

These exclusions will become important when we look at 'How to search and browse ENA'. For example, if you want mouse sequences, you must be careful not to select ROD (rodent), as mouse sequences are excluded from this division. Later we will learn that there are exceptions to this rule when searching with the ENA browser, which merges taxonomic divisions together to make searching simpler.

 


What if no taxonomy is associated with a sequence?

All EMBL-Bank entries are assigned to a taxonomic division. However, some sequences are not associated with any formal taxonomy (e.g. synthetic sequences), while other sequences are derived from organisms whose taxonomy could  not be determined (e.g. metagenomic sequences). ENA has special taxonomic divisions to overcome this problem:

Environmental sequences

ENV: contains metagenomic and environmental sequences where the taxonomy is unknown or is from a high taxonomic node such as a kingdom (Figure 10).

  • The taxonomy is displayed as: Organism = uncultured bacterium, uncultured eukaryote, metagenome...
ENA entry containing metagenomic data where no species was identified
 
Figure 10. ENA entry containing metagenomic data where no species was identified.
Notes

[A] Organism is only described in general terms ('termite gut metagenome') because little is known about the species the sequence came from.

[B] Taxonomic Division classified as ENV because the precise organism could not be identified.

 

Synthetic sequences

SYN contains synthetic or experimentally altered sequences.

  • The taxonomy is displayed as: Organism = synthetic construct.

Transgenic sequences

TGN contains transgenic sequences. Transgenics occur when genetic material from one species is transferred to another species, either naturally or through genetic engineering techniques.

  • Taxonomy is provided for both donor and recipient organisms.

Unclassified sequences

UNC contains unclassified sequences typically obtained from patents, for which taxonomic data is not always available.

  • The taxonomy is displayed as: Organism = unidentified