What are metadata and why are they so important?

 

Metadata are the in-depth, controlled description of the sample that your sequence was taken from. Essentially, the ‘what, where, how, and when' of your study from collection to sequence generation, plus contextual data such as environmental conditions (latitude, longitude, temperature) or clinical observations.

It is essential to describe your samples with such data in order to carry out a meaningful comparison with other samples or projects. For this to happen, all submitters must use a common set of terms to ensure that the vocabulary used to define metadata is constrained.

Using a controlled vocabulary to describe metadata

To help describe data, ENA uses sets of controlled vocabularies such as the Environmental Ontology (ENVO), which can be accessed on the ENVO page (Figure 1).

Looking for the correct ENVO term to describe the biome environment on the Environmental Ontology website

Figure 1 Looking for the correct ENVO term to describe the biome environment on the Environmental Ontology website.

Surely there is only one way to describe where a sample of lake water comes from!

Here are some of the terms that could be used to describe it:

- fresh water lake         - lake                - inland sea

- pond                        - lentic water      - millpond

- inland body of water   - loch                - lentic habitat

- lakelet                      - still water        - mere .....

Can you see the confusion that could arise by not using a controlled vocabulary? However, by using ENVO: freshwater lake (ENVO:00000021), there is no ambiguity.

The classification is hierarchical and would be:

'Environmental: Geographic feature: Hydrographic feature: Water body: Lake: Freshwater lake' for these samples.