Guide to generate PRIDE XML files
- The submission content
- Creating PRIDE XML files
- Checking PRIDE XML files
- Recommended ontologies
The default PRIDE Submission is a ProteomeXchange Complete Submission where the Result files are PRIDE XML files. Concerning the Submission Process please consult our Submission Guidelines. On this page you can find the most important information to generate PRIDE XML files. Please read the page first. If you have unanswered problems or questions please e-mail us at email@example.com.
In order to do a Complete PX Submission (link to the final PX Submission page later) the PRIDE XML files should contain the following data types: mass spectra, peptide and protein identifications. On top of that, the goal is to have as much metadata as possible to assure the proper biological and technical background to the submitted data.
2.1.1- Spectra: The most important core information related to MS/MS spectra are the spectra themselves. Some metadata can be also added to the spectra individually using PRIDE Converter 2.
2.1.2- Protein identifications: Protein identifications should be supported by peptide identifications. We don't allow to have list of proteins without peptide assignments. Protein accession numbers are given with the searched protein database and its version listed.
Our PICR mapping tool will map your protein accessions to some of the most popular protein sequence databases automatically.
2.1.3- Peptide identifications: Peptide sequences and the start and end positions of the peptide sequences related to the identified proteins are presented.
2.1.4- PTMs: Post-translational modifications (both natural and artifactual) are listed in case of the peptides alongside with location in the peptide. Currently in all the PRIDE related tools we support the PSI-MOD ontology. UniMod should not be used any more to report modifications in PRIDE XML files.
Metadata refers to additional descriptive information about the project and experiments, the biological samples under examination, protocols, instrumentation, software used to generate the data and references to associated data files amongst others. Concerning metadata details please check our comprehensive metadata table.
The proper storage of 2D gel and quantitative proteomics data (labelled and label-free methods) is not supported currently. However there is an experimental support for labelled methods in case of Mascot files.
Submitters can create PRIDE xml files out of the MS/MS output data using the PRIDE Converter 2 tool. PRIDE Converter 2 is completely free and open source. PRIDE Converter 2 converts MS/MS data from most common data formats into valid PRIDE XML files. For a Complete PX Submission with PRIDE Converter 2 you can generate PRIDE XML files containing protein identifications and peptide assignments alongside with spectra information in the following formats:
|Input data format||extension||mass spectra data||peptide identifications||protein identifications|
|Crux||.txt and .dta or .mgf or .pkl or .mzData or .mzXML||+||+||+|
|Mascot DAT File||.dat||+||+||+|
|MSGF||.msgf and .mzXML or .mgf or .ms2 or .pkl||+||+||+|
|OMSSA||.csv and .dta or .mgf or .pkl or .xml or .ms2||+||+||+|
|SpectraST||.txt and .dta or .mgf or .pkl or .mzData or .mzXML||+||+||+|
|X! Tandem Result File + spectra files||.xml and .dta or .mgf or .pkl or .mzData or .mzXML||+||+||+|
The support for some data formats is more up-to-date than for other ones. This is usually driven by the number of submitters for a particular format.
Input formats, not supported by PRIDE Converter 2, but supported by PRIDE Converter 1, for instance ms_lims, SEQUEST Result Files and Spectrum Mill can be still be converted and used upon submission.
Using third party libraries/software tools: Generally PRIDE does not support third party libraries generating PRIDE xml files (for instance: Phenyx, GPMD, Proteios...) Nevertheless if you think that the file you created is a valid PRIDE xml file containing all the core and metadata information ready for submission then please contact PRIDE support.
PRIDE xml is the internal data format and submission format of PRIDE. You can find the schema documentation here.
Before uploading your files as part of a PX Complete Submission (link), we recommend you to use our tool PRIDE Inspector. PRIDE Inspector has been developed to visualize and perform an initial quality assessment on the PRIDE XML files generated using PRIDE Converter 2. PRIDE Inspector is completely free and open source. You can download it from here. Installation details, requirements and troubleshooting can be found here.
Metadata refers to additional descriptive information about the project and experiments, the biological samples under examination, protocols, instrumentation, software used to generate the data and references to associated data files amongst others.
Requirement levels: required, recommended (optional).
|Title||description of the particular experiment||required||unique title recommended||T.forsythia LC-MALDI - Band 1|
|Project||comprehensive name of the project||required||PRIDE:0000097 like manuscript title||A high-density, organ-specific proteome map for Arabidopsis thaliana|
|Reference||links to any literature citations for which this experiment provides supporting evidence.||it must be given once the article has been accepted and the experiments made public||1. PMID (2. DOI)||19663511|
|ShortLabel||grouping/organising experiments in meaningful ways||required||cannot be an empty string||Control Exp II|
|ProtocolName||The protocol element defines the sample processing steps that have been performed||required||as many details as possible||In Gel Protein digestion - Chymotrypsin, Reduction - DTT, Alkylation - iodoacetamide, Enzyme - Chymotrypsin|
|ProtocolSteps and StepsDescription||The protocol element defines the sample processing steps that have been performed||recommended||PRIDE cv terms||<cvParam cvLabel="PRIDE" accession="PRIDE:0000026" name="Alkylation" value="iodoacetamide" />|
|Experiment description||description of the goals and objectives of this study, summary of the abstract but it can be as long as the abstract||recommended||2-3 sentences||Identification of ubiquitin remnant peptides using a diglycyl-lysyl monoclonal antibody for immunoprecipitation|
|SampleName||biological sample used to generate the dataset||required||A short label that is referable to the sample used to generate the dataset||Mouse embryonic stem cells|
|sampleDescription||cvParam: Expansible description of the sample used to generate the dataset||recommended||NEWT and BTO, see Recommended Ontologies below||<cvParam cvLabel="NEWT" accession="10090" name="Mus musculus (Mouse)" />|
|Contact||name, institution, contactInfo||required||valid email address minimally provided as contact info||Joe Poster, European Bioinformatics Institute, firstname.lastname@example.org|
|InstrumentName||Descriptive name of the instrument (make, model, significant customisations)||required||manufacturer name for model||LTQ-Orbitrap|
|Source||ion source information||required||child of term MS:1000008||MS:1000398 nanoelectrospray|
|Analyzer||single or multiple components of the mass analyzer||required||children terms of MS:1000443||MS:1000081 quadrupole|
|Detector||detector type used||required||children terms of MS:1000026||MS:1000114: microchannel plate detector|
|SoftwareName||list of any kind of software used during data acquisition and data processing, the software that produced the peak list||required||list all software that's been used during data processing||Mascot Distiller|
|SoftwareVersion||version of the software used||recommended||MDRO 220.127.116.11|
|ProcessingMethod||Description of the default peak processing method||required||children terms of MS:1000452||MS:1000033 deisotoping|
|SearchEngine||name of the protein search engine used||required||version can be given here||Mascot 2.2.1|
|XML generation software||the software used to generate the PRIDE xml file||recommended||PRIDE:0000175||PRIDE Converter Toolsuite 2.0|
|Original MS data file format||Original format of the file containing MS data||recommended||PRIDE:0000218||Mascot DAT File|
We recommend using the following ontologies for annotating data and metadata.
|ID||name||source link||description & use|
|MS||PSI Mass Spectrometry Ontology||source||A structured controlled vocabulary for the annotation of mass spectrometry experiments. Developed by the HUPO Proteomics Standards Initiative.|
|PSI-MOD||PSI protein modification ontology||http://psidev.sourceforge.net/mod/data/PSI-MOD.obo||protein chemical modifications, classifying protein modifications either by the molecular structure of the modification, or by the amino acid residue that is modified|
|NEWT||new taxonomy portal||http://www.uniprot.org/taxonomy/ http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=NEWT||organisms are classified in a hierarchical tree structure, to specify the sample species used in the experiments|
|BTO||BRENDA tissue / enzyme source||source||A structured controlled vocabulary for the source of an enzyme. It comprises terms of tissues, cell lines, cell types and cell cultures from uni- and multicellular organisms|
|GO||Gene Ontology||source||gene product characteristics and gene product annotation data