Guide to generate PRIDE XML files

Index

  1. Introduction
  2. The submission content
    1. Core data
    2. Metadata
    3. 2D gel data and quantitative data
  3. Creating PRIDE XML files
  4. Checking PRIDE XML files
  5. Metadata
  6. Recommended ontologies

1. Introduction

The default PRIDE Submission is a ProteomeXchange Complete Submission where the Result files are PRIDE XML files. Concerning the Submission Process please consult our Submission Guidelines. On this page you can find the most important information to generate PRIDE XML files. Please read the page first. If you have unanswered problems or questions please e-mail us at pride-support@ebi.ac.uk.

 

2. The data content of PRIDE XML files

In order to do a Complete PX Submission (link to the final PX Submission page later) the PRIDE XML files should contain the following data types: mass spectra, peptide and protein identifications. On top of that, the goal is to have as much metadata as possible to assure the proper biological and technical background to the submitted data.

2.1 Core data: spectra and peptide/protein identifications, post-translational modifications (PTMs)

2.1.1- Spectra: The most important core information related to MS/MS spectra are the spectra themselves. Some metadata can be also added to the spectra individually using PRIDE Converter 2.

2.1.2- Protein identifications: Protein identifications should be supported by peptide identifications. We don't allow to have list of proteins without peptide assignments. Protein accession numbers are given with the searched protein database and its version listed.

Our PICR mapping tool will map your protein accessions to some of the most popular protein sequence databases automatically.

2.1.3- Peptide identifications: Peptide sequences and the start and end positions of the peptide sequences related to the identified proteins are presented.

2.1.4- PTMs: Post-translational modifications (both natural and artifactual) are listed in case of the peptides alongside with location in the peptide. Currently in all the PRIDE related tools we support the PSI-MOD ontology. UniMod should not be used any more to report modifications in PRIDE XML files.

2.2 Metadata

Metadata refers to additional descriptive information about the project and experiments, the biological samples under examination, protocols, instrumentation, software used to generate the data and references to associated data files amongst others. Concerning metadata details please check our comprehensive metadata table.

2.3 Additional proteomics data: 2D gel data and quantitative data

The proper storage of 2D gel and quantitative proteomics data (labelled and label-free methods) is not supported currently. However there is an experimental support for labelled methods in case of Mascot files.

 

3. Creating PRIDE XML files

Submitters can create PRIDE xml files out of the MS/MS output data using the PRIDE Converter 2 tool. PRIDE Converter 2 is completely free and open source. PRIDE Converter 2 converts MS/MS data from most common data formats into valid PRIDE XML files. For a Complete PX Submission with PRIDE Converter 2 you can generate PRIDE XML files containing protein identifications and peptide assignments alongside with spectra information in the following formats:

Input data format extension mass spectra data peptide identifications protein identifications
Crux .txt and .dta or .mgf or .pkl or .mzData or .mzXML + + +
Mascot DAT File .dat + + +
MSGF .msgf and .mzXML or .mgf or .ms2 or .pkl + + +
OMSSA .csv and .dta or .mgf or .pkl or .xml or .ms2 + + +
Proteome Discoverer .msf + + +
SpectraST .txt and .dta or .mgf or .pkl or .mzData or .mzXML + + +
X! Tandem Result File + spectra files .xml and .dta or .mgf or .pkl or .mzData or .mzXML + + +

The support for some data formats is more up-to-date than for other ones. This is usually driven by the number of submitters for a particular format.

Input formats, not supported by PRIDE Converter 2, but supported by PRIDE Converter 1, for instance ms_lims, SEQUEST Result Files and Spectrum Mill can be still be converted and used upon submission.

Using third party libraries/software tools: Generally PRIDE does not support third party libraries generating PRIDE xml files (for instance: Phenyx, GPMD, Proteios...) Nevertheless if you think that the file you created is a valid PRIDE xml file containing all the core and metadata information ready for submission then please contact PRIDE support.

 

4. Checking PRIDE XML files

PRIDE XML schema

PRIDE xml is the internal data format and submission format of PRIDE. You can find the schema documentation here.

Check the content of the PRIDE XML files generated with PRIDE Inspector

Before uploading your files as part of a PX Complete Submission (link), we recommend you to use our tool PRIDE Inspector. PRIDE Inspector has been developed to visualize and perform an initial quality assessment on the PRIDE XML files generated using PRIDE Converter 2. PRIDE Inspector is completely free and open source. You can download it from here. Installation details, requirements and troubleshooting can be found here.

 

5. Metadata

Metadata refers to additional descriptive information about the project and experiments, the biological samples under examination, protocols, instrumentation, software used to generate the data and references to associated data files amongst others.

The following table presents metadata information by name, description, requirement level, recommendation and example.

Requirement levels: required, recommended (optional).

name description requirement level recommendation example
Title description of the particular experiment required unique title recommended T.forsythia LC-MALDI - Band 1
Project comprehensive name of the project required PRIDE:0000097 like manuscript title A high-density, organ-specific proteome map for Arabidopsis thaliana
Reference links to any literature citations for which this experiment provides supporting evidence. it must be given once the article has been accepted and the experiments made public 1. PMID (2. DOI) 19663511
ShortLabel grouping/organising experiments in meaningful ways required cannot be an empty string Control Exp II
ProtocolName The protocol element defines the sample processing steps that have been performed required as many details as possible In Gel Protein digestion - Chymotrypsin, Reduction - DTT, Alkylation - iodoacetamide, Enzyme - Chymotrypsin
ProtocolSteps and StepsDescription The protocol element defines the sample processing steps that have been performed recommended PRIDE cv terms <cvParam cvLabel="PRIDE" accession="PRIDE:0000026" name="Alkylation" value="iodoacetamide" />
Experiment description description of the goals and objectives of this study, summary of the abstract but it can be as long as the abstract recommended 2-3 sentences Identification of ubiquitin remnant peptides using a diglycyl-lysyl monoclonal antibody for immunoprecipitation
SampleName biological sample used to generate the dataset required A short label that is referable to the sample used to generate the dataset Mouse embryonic stem cells
sampleDescription cvParam: Expansible description of the sample used to generate the dataset recommended NEWT and BTO, see Recommended Ontologies below <cvParam cvLabel="NEWT" accession="10090" name="Mus musculus (Mouse)" />
Contact name, institution, contactInfo required valid email address minimally provided as contact info Joe Poster, European Bioinformatics Institute, jposter@ebi.ac.uk
InstrumentName Descriptive name of the instrument (make, model, significant customisations) required manufacturer name for model LTQ-Orbitrap
Source ion source information required child of term MS:1000008 MS:1000398 nanoelectrospray
Analyzer single or multiple components of the mass analyzer required children terms of MS:1000443 MS:1000081 quadrupole
Detector detector type used required children terms of MS:1000026 MS:1000114: microchannel plate detector
SoftwareName list of any kind of software used during data acquisition and data processing, the software that produced the peak list required list all software that's been used during data processing Mascot Distiller
SoftwareVersion version of the software used recommended   MDRO 2.3.2.0
ProcessingMethod Description of the default peak processing method required children terms of MS:1000452 MS:1000033 deisotoping
SearchEngine name of the protein search engine used required version can be given here Mascot 2.2.1
XML generation software the software used to generate the PRIDE xml file recommended PRIDE:0000175 PRIDE Converter Toolsuite 2.0
Original MS data file format Original format of the file containing MS data recommended PRIDE:0000218 Mascot DAT File

 

6. Recommended ontologies

We recommend using the following ontologies for annotating data and metadata.

ID name source link description & use
MS PSI Mass Spectrometry Ontology source A structured controlled vocabulary for the annotation of mass spectrometry experiments. Developed by the HUPO Proteomics Standards Initiative.
PSI-MOD PSI protein modification ontology http://psidev.sourceforge.net/mod/data/PSI-MOD.obo protein chemical modifications, classifying protein modifications either by the molecular structure of the modification, or by the amino acid residue that is modified
NEWT new taxonomy portal http://www.uniprot.org/taxonomy/ http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=NEWT organisms are classified in a hierarchical tree structure, to specify the sample species used in the experiments
BTO BRENDA tissue / enzyme source source A structured controlled vocabulary for the source of an enzyme. It comprises terms of tissues, cell lines, cell types and cell cultures from uni- and multicellular organisms
GO Gene Ontology source gene product characteristics and gene product annotation data