Data processing

Data processing aims to extract biologically relevant information from the acquired data. It includes many steps that are similar for MS and NMR. A good understanding of the steps involved is important in order to minimise the risk of skewed or false results. Typically, the endpoint of MS and NMR metabolomics studies is an (annotated) feature matrix as seen in Figure 8. A feature is typically a peak or signal that represents a chemical compound. Thus, a feature matrix contains the intensities or (relative) abundances of relevant signals for every sample, describing the metabolomics fingerprint. Ultimately, this feature list would become a list of identified metabolites with semi-quantified or quantified values. Transpositions of the matrix are also common.

Figure 8 Example of an MS feature matrix

To compile a feature matrix, noise reduction and background correction are essential before features can be extracted via peak picking. This process greatly tidies up the data. Extracted features of individual samples are then aligned across samples to compensate for drifts in the chemical shift (NMR) or retention time (MS) as seen in Figure 9. Aligned features can then be aggregated in a feature matrix: a feature has a characteristic chemical shift (NMR) or mass (MS) that can be used as column header. The rows represent individual samples.

Signal distortion image
Figure 9 Combining extracted features together

A summary of components contributing to signal distortions.

  • a – Random noise adds variation to a signal around the mean (zero)
  • b – Systematic noise, e.g. baseline drifts, introduces a systematic drift or bias in the data that needs to be removed before data analysis; Systematic noise can impact heavily on signal intensities and derived signal areas
  • c – The actual signal follows – in theory – a Gaussian distribution; deviations from this distribution reflect external factors
  • d – Overlay of components (a), (b), and (c), and the resulting ‘measured’ signal (black line)

Noise (or error) is an important consideration to factor in because it distorts the signals in your data. There are two types of noise:

Random noise

This results from contaminants and general technological limitations. It produces signal spikes and discontinuous data that could be mistaken for meaningful data.

Systematic noise

This results from external factors that are not relevant for the study. Baseline drift is one example of systematic noise and is a common problem in liquid chromatography-mass spectrometry (LC-MS) where the gradient of the mobile phase causes the chromatographic baseline to be irregular (Figure 9d).