Proteomics data formats

Proteomics data is available in a variety of formats, the ones used by Pride and ProteomeXchange are
defined here:

File name


File content


Mass spectrometry output files (‘Raw’ data)                                                     

This is the data and metadata generated by mass spectrometers.  The data may be the original profile mode scans or may already have had some basic processing, such as centroiding, applied.

They may be available as mass spectrometer binary output files, or as peak list spectra in a standardised format (see below) and not as processed peak lists (see below).

It is important that all the scans generated contain applicable metadata.

Standardised MS data formats

Three MS data formats used in proteomics:

mzXML - developed at the Institute of Systems Biology (ISB)

mzData - (now obsolete) originally developed by the HUPO Proteomics Standards Initiative (PSI)

mzML - successor to the others (developed by the ISB and PSI).

These data formats can be used to represent processed peak lists, as well as raw data. In addition to the mass spectra, they contain detailed metadata that gives context to the information.

Processed peak lists

Heavily processed form of mass spectrometry data, usually derived from raw data files via various
(semi-­) automatic steps, e.g.: centroiding, deisotoping and charge deconvolution.  These files are formatted in plain text, with typical formats like dta, pkl, ms2 or mgf.

Search engine output files

These files contain the data and metadata generated by the software (called search engines) used for performing the identification and quantification of peptides and proteins. Each search engine has its own specific output file format. The outputs are typically formatted in either plain text or XML.

mzIdentML - provides a common format for the export of identification results from any search engine.

mzQuantML - provides a common format for the export of quantification results from any search engine.

mzTab - represents both identification and basic quantification results.

To allow a full representation of the processed results in the PRIDE database and in the PX tool, the search engine output files need to be converted to PRIDE XML. PRIDE Converter and PRIDE Converter 2 are the two tools developed by the PRIDE team to make this conversion possible.

Protein/peptide identifications

Proteomics mass spectra can be matched to peptides or proteins, resulting in identifications for those spectra. Typically a spectrum is considered to have been identified if the score attributed to a peptide or protein match qualifies against an a priori or a posteriori defined threshold. In the case of fragmentation spectra, the initial identification will consist of a peptide sequence; subsequent steps will derive a list of proteins from the identified peptides. The protein assembly step can be a discernible process with its own input and output files, or it can be implicit in the overall identification software.

Protein/peptide quantification

Protein/peptide expression values can also be obtained from an MS-­based proteomics experiment and then this data and metadata is used for performing the quantification analysis of peptides and proteins.

Metadata

A term used to describe data that provides additional information about a particular data set. This information can include how, when and where the data set was generated and what standards were used. In the proteomics context the addition of metadata such as peptide and protein identifications and quantification of their expression values gives meaning to a simple collection of mass spectra output files.