Describing data consistently
What is metadata?
To be useful, data need to be set in context. One way of doing this is to associate them with metadata – essentially data about the data. If you’re involved in sequencing samples from the environment, perhaps to understand biodiversity in different conditions, or to investigate associations between crop yield and differences in soil flora, it would be useful to know when and where your samples were collected for instance. Standardised descriptors of collection time and geographical location can then be associated with any sequence derived from each sample.
In this short video, Sarah Morgan, previously Scientific Training Coordinator at EMBL-EBI, discusses what metadata is and why it is important to keep track of this information in biological experiments.
The importance of metadata
There are databases dedicated to metadata organisation and storage. For example, the BioSamples database contains metadata on samples used to generate data stored in ENA, PRIDE and ArrayExpress. Storing metadata in this way ensures that a specific sample is referred to consistently in several data resources.
Let’s imagine that the same germplasm sample stored in a seed bank has been used for genomic sequencing, proteomics and RNAseq; these three related experiments can be related to each other by all pointing back to the same record in the BioSamples database. It’s then possible to look at patterns of gene expression and protein production in this sample and compare them to others to learn about how the seed is adapted to a specific environment. Storing the metadata in just one database, rather than as part of the records in three or more separate ones, is also more cost-effective in terms of data storage – an issue that has to be taken extremely seriously in the age of big data.

Describing data and metadata consistently
It is vital that both the data and the metadata are represented in a consistent manner. To take a simple example, let’s imagine that two groups have been working on the effect of antidepressants on gene expression in primary cell cultures of neurones. One of them uses the generic names of the drugs to describe their experiments; the other uses proprietary names. Furthermore, despite isolating their cells from the same tissue using very similar methods, they have different names for their cell lines and use these in their database submission. A computer would think that these two experiments were completely unrelated; and even a human searching for one experiment would be unlikely to find the other. This is why there are agreed standards to describe data – and why databases like those at EMBL-EBI require researchers to annotate their data using these standards when submitting their data to us.
For many types of metadata there are accepted international standards applicable to many fields; for example, if we want to represent location, we can use the standard notation for longitude and latitude. However, as new areas of biology have emerged, and as new technologies have been developed to study them, the research community has had to develop and agree on new standards.