Describing data consistently

The importance of metadata

To be useful, data need to be set in context. One way of doing this is to associate them with metadata - essentially data about the data. For example, if you’re involved in sequencing samples from the environment, perhaps to understand biodiversity in different conditions, or to investigate associations between crop yield and differences in soil flora, it would be useful to know when and where your samples were collected. Standardised descriptors of collection time and geographical location can then be associated with any sequence derived from each sample. Indeed, metadata is so important that we create databases dedicated to organising it. For example, the BioSamples database contains metadata on samples used to generate data stored in ENA, PRIDE and ArrayExpress. Storing metadata in this way ensures that a specific sample is referred to consistently in several data resources.

Let’s imagine, for example, that the same germplasm sample stored in a seed bank has been used for genomic sequencing, proteomics and RNAseq; these three related experiments can be related to each other by all pointing back to the same record in the BioSamples database. It would then be possible to look at patterns of gene expression and protein production in this sample and compare them to others to learn about how the seed is adapted to a specific environment. Storing the metadata in just one database, rather than as part of the records in three or more separate ones, is also more cost-effective in terms of data storage - an issue that has to be taken extremely seriously in the age of big data.

Relationships between different types of data standard

Figure 6 Relationships between different types of data standard. Figure modified from a slide provided by Sandra Orchard.

Describing data and metadata consistently

What is absolutely vital here is that both the data and the metadata are represented in a consistent manner. To take a simple example, let’s imagine that two groups have been working on the effect of antidepressants on gene expression in primary cell cultures of neurones. One of them uses the generic names of the drugs to describe their experiments; the other uses proprietary names. Furthermore, despite isolating their cells from the same tissue using very similar methods, they have different names for their cell lines and use these in their database submission. A computer would think that these two experiments were completely unrelated; furthermore, a human searching for one experiment would be unlikely to find the other. This is why we use agreed standards to describe data - and why, when researchers are submitting their data to us, we ask them to annotate their own data using these standards.

For many types of metadata there are accepted international standards applicable to many fields; for example, if we want to represent location, we can use the standard notation for longitude and latitude. However, as new areas of biology have emerged, and as new technologies have been developed to study them, the research community has had to develop and agree on new standards.