How to submit data to PRIDE as a ProteomeXchange Submission

Submitting your data to PRIDE:

PRIDE welcomes direct user submissions of protein and peptide identification/quantification data with the accompanying mass spectral evidence to be published in peer-reviewed publications. The main focus of PRIDE is to support the deposition of shotgun MS/MS proteomics datasets.

The current way of submitting data to PRIDE is following the ProteomeXchange (PX) consortium guidelines. The current page contains summarized introduction to the 2 main PX Submission modes. If you need more information a detailed tutorial is available at the ProteomeXchange web site. Alternatively please contact pride-support@ebi.ac.uk for assistance or advice.

The PX/ PRIDE submission process involves the following elements seen on the figure below.

Aspera enabled PX workflow

Figure 1: PX/PRIDE Submission Workflow.

Follow these steps to submit your data:

1. Register

2. Choose submission type

3. Submit the dataset

4. Accessing private data

5. Post-submission steps


1. Register

If you do not already have a PRIDE account, create one here. Currently we don't send out automatic emails upon succesful registration. Please contact pride-support@ebi.ac.uk if your login information is not valid after 24 hours following registration.

 

2. Choose submission type

You can choose between 2 main submission types depending on the availability of mzIdentML or PRIDE XML files as "Result" files for Complete Submissions. The recommended submission subtype is a Complete Submission, but alternatively Partial Submissions are accepted as well.

2A. Complete Submission: mzIdentML- or PRIDE XML-based

The 2 subtypes of Complete Submissions are either mzIdentML- or PRIDE XML-based. Complete Submissions mixing the 2 types of ‘RESULT’ files are not allowed. 

An mzIdentML-based Complete Submission requires 3 types of files:

  • Result files: mzIdentML 1.1 files with identifications provided. In the submission tool they should be tagged as “RESULT”. It is also recommended to check your mzIdentML files before submission using the PRIDE Inspector tool (the mzIdentML supporting version will be out in early January, 2014). mzIdentML version 1.0 files are not supported.
  • Peak list files: Since the mzIdentML files themselves do not contain the spectra information it is mandatory to provide the peak list files (eg. mgf files) that were used for the original search and are referenced in the mzIdentML file. These are different from the provided mandatory raw files. In the submission tool they should be tagged as “PEAK” and the submission tool will try to automatically map the peak files to the mzIdentML file where they are listed.
  • Raw files: the MS instrument output files, for instance Thermo RAW files. As an alternative, lightly processed  mzML, mzXML, mzData files are acceptable if MS1 level spectra information is available and the different peak processing steps are known. In the submission tool they should be tagged as “RAW”.

Please check our Guide to generate mzIdentML files. It is possible that you are already using a pipeline/search engine where mzIdentML files are amongst the native search engine output formats. mzIdentML files can be created/exported already with numerous tools, please see a list here.

Besides the three mandatory file types above, there are optional and recommended file types that can be prepared and uploaded as well:

  • Search engine result files: The original output from your search engine or your analysis pipeline used by you for further post-processing, such as Mascot .dat files, Trans Proteomics Pipeline (TPP) pep.xml and/or prot.xml files among many others. In case your search engine generated mzIdentML files by default you already provided them as "Result" files. They search engine files should contain peptide/protein identification results. In the submission tool they should be tagged as “SEARCH”.
  • Quantification result files: In many cases current mass spectometry proteomics studies do involve a quantitative analysis on the peptides/proteins present in the samples. Quantification related files reporting on peptide/protein quantitative values/ratios can be provided and tagged as "QUANT" in the submission tool.
  • Sequence database files: Sequence database file (usually in FASTA format) that was used to perform the mass spectral search. Sequence database files (both protein and DNA) are labelled as ‘FASTA’ in the tool.
  • Spectrum libraries:Spectral library file that was used for performing the mass spectrometry search. In the PX Submission Tool they should be tagged as ‘SPECTRUM_LIBRARY’.
  • Gel image files: In case two-dimensional gel electrophoresis has been used as a separation method the gel image files can be provided. In the submission tool they should be tagged as ‘GEL’.
  • Other files: Everything else that did not fit into the 6 categories above for instance protein inference files generated by post-processing of the search engine results or R scripts used for data analysis. If you have used custom search databases you can provide those as well. In the submission tool they should be tagged as ‘OTHER’.

A PRIDE XML-based Complete Submission requires 2 types of files:

  • Result files: fully supported by PRIDE: PRIDE XML files with identifications provided. In the submission tool they should be tagged as “RESULT”. It is also recommended to check your PRIDE XML files before submission using the PRIDE Inspector tool.
  • Raw files: the MS instrument output files, for instance Thermo RAW files. As an alternative, lightly processed  mzML, mzXML, mzData files are acceptable if MS1 level spectra information is available and the different peak processing steps are known. In the submission tool they should be tagged as “RAW” and mapped to the corresponding "RESULT" files.

Try to create PRIDE XML files using the PRIDE Converter 2 tool. Please take a moment to review our Guide to generate PRIDE XML files concerning the input files you can use for PRIDE XML generation. Please also note that although PRIDE Converter 2 can theoretically convert mzIdentML files to PRIDE XML files in practice this option should not be used and mzIdentML files should be used natively for Complete Submissions. There are other tools that can produce PRIDE XML files, not mantained by the PRIDE team, like PeptideShaker, Waters PLGS, ProteiosEasyProt, MIAPE Extractor (ProteoRed), or the original PRIDE Converter (no longer further developed). 

Besides the two mandatory file types above, there are optional and recommended file types that can be prepared and uploaded as well:

  • Peak list files. It is strongly recommended to provide the peak list files (eg. mgf files) that were used for the original search and these are different from the provided mandatory raw files. In the submission tool they should be tagged as “PEAK”.
  • Search engine result files: the original output from your search engine or your analysis pipeline, such as Mascot .dat files, Trans Proteomics Pipeline (TPP) pep.xml and/or prot.xml files or mzIdentML files, among many others. They should contain the peptide/protein identifications. In the submission tool they should be tagged as “SEARCH”.
  • Quantification result files: In many cases current mass spectometry proteomics studies do involve a quantitative analysis on the peptides/proteins present in the samples. Quantification related files reporting on peptide/protein quantitative values/ratios can be provided and tagged as "QUANT" in the submission tool.
  • Sequence database files: Sequence database file (usually in FASTA format) that was used to perform the mass spectral search. Sequence database files (both protein and DNA) are labelled as ‘FASTA’ in the tool.
  • Spectrum libraries:  Spectral library file that was used for performing the mass spectrometry search. In the PX Submission Tool they should be tagged as ‘SPECTRUM_LIBRARY’.
  • Gel image files: In case two-dimensional gel electrophoresis has been used as a separation method the gel image files can be provided. In the submission tool they should be tagged as ‘GEL’.
  • Other files: Everything else that did not fit into the 6 categories above for instance protein inference files generated by post-processing of the search engine results or R scripts used for data analysis. If you have used custom search databases you can provide those as well. In the submission tool they should be tagged as ‘OTHER’.

In case of a Complete Submission a DOI (Digital Object Identifier) will be assigned to your dataset and its transparency level will be higher. That is good for your data and good for the community.

2B. Partial Submission

A Partial Submission requires 2 types of files:

  • Search engine result files: the original output from your search engine or your analysis pipeline, such as Trans Proteomics Pipeline (TPP) pep.xml and/or prot.xml files among many others. They should contain the peptide/protein identifications. In the submission tool they should be tagged as “SEARCH”.
  • Raw files: the MS instrument output files, for instance Thermo RAW files. As an alternative, lightly processed  mzML, mzXML, mzData files are acceptable if MS1 level spectra information is available and the different peak processing steps are known. In the submission tool they should be tagged as “RAW”.

As a result of a Partial Submission, all the files will be available to download, but identification/quantification results will not be fully searchable in PRIDE. However, it will still be possible to search in PRIDE based on the project metadata.

Besides the two mandatory file types above, there are optional and recommended file types that can be prepared and uploaded as well:

  • Peak list files. It is strongly recommended to provide the peak list files (eg. mgf files) that were used for the original search and these are different from the provided mandatory raw files. In the submission tool they should be tagged as “PEAK”.
  • Quantification result files: In many cases current mass spectrometry proteomics studies do involve a quantitative analysis on the peptides/proteins present in the samples. Quantification related files reporting on peptide/protein quantitative values/ratios can be provided and tagged as "QUANT" in the submission tool.
  • Sequence database files: Sequence database file (usually in FASTA format) that was used to perform the mass spectral search. Sequence database files (both protein and DNA) are labelled as ‘FASTA’ in the tool.
  • <strong">Spectrum libraries:  Spectral library file that was used for performing the mass spectrometry search. In the PX Submission Tool they should be tagged as ‘SPECTRUM_LIBRARY’.
  • Gel image files: In case two-dimensional gel electrophoresis has been used as a separation method the gel image files can be provided. In the submission tool they should be tagged as ‘GEL’.
  • Other files: Everything else that did not fit into the 5 categories above for instance protein inference files generated by post-processing of the search engine results or R scripts used for data analysis. If you have used custom search databases you can provide those as well. In the submission tool they should be tagged as ‘OTHER’.

There's special support provided for mass spectrometry imaging datasets that can be submitted as Partial Submissions, please see the details here. 

Partial Submission means that a PX accession number will be assigned to your files but PRIDE experiment accession numbers won't be issued. Also, you won't have a DOI assigned to your dataset.

3. Submit the dataset

In addition to the files that are part of the submission, for any given PRIDE/PX submission, a submission summary file (submission.px) needs be generated. The format is described here. A submission summary file contains 2 types of crucial information needed for any PX Submissions: 

  • Metadata: experimental metadata like experiment description, sample taxonomy information, instruments and modifications used.
  • Mapping between the uploaded files: for instance between the raw files and the corresponding result or search engine output files.

If you are using the PX submission tool, the file will be created automatically for you.

If you are using the Aspera command line submission mode, you will need to create one. You can do it using with the PX submission tool (without actually uploading the data and saving the submission.px independently) or with scripting

By now you know have assembled the files belonging to your dataset. You have also generated a submission summary file. You are ready to upload your data associated with your project/manuscript!

3A. Data upload via the PX Submission Tool

Requirements: The PX Submission tool is a desktop application.

  • Java: Java JRE 1.6 (or above), which you can download here (Note: most computers should have Java installed already).
  • Operating System: The current verison has been tested on Windows 7, Windows Vista, Linux and Max OS X, it should also work on other platforms. If you come across any problems on your platform, please contact the PRIDE Help Desk.

From here you can download the PX Submission tool. By default the PX Submission Tool is using the fast Aspera upload transfer protocol with which terabytes can be transferred within a day. Should you encounter any problems with Aspera you can switch to FTP mode.

Once you have launched the tool click the submission type (Complete or Partial) you have chosen earlier.

 

PX Submission Tool 2.0 Welcome Screen

Figure 2: Welcome screen showing the two default submission types.

 

In case you need more information on the different steps of the data submission process please use this tutorial.

If you are using optional compression for big files then please use gzip.

3B Data upload via Aspera command line

Requirements: Please download the Aspera Connect Web Browser Plug-in. Although you download a Browser Plug-in you will be using the 'ascp' command line transfer program distributed with it. That means command line skills are needed in order to use the Aspera upload option.

  • Operating System: Windows XP / 2003 / Vista / 2008 / 7 / 8, Mac OS Intel 10.5 / 10.6 / 10.7 / 10.8

You don't have to register in order to download the Browser Plug-in and the download is free of charge.

For the actual how-to please consult our Aspera page.

Internal validation and submission after successful dataset upload

Upon successful submission of the data with the PX Submission Tool, a submission reference ticket will be generated and emailed to you. This ticket is not the same as the PX Identifier issued at the end of the internal submission process and cannot be used to reference PX datasets in a manuscript.

In case of the command line upload no tickets will be generated for you externally. All datasets will first go through a validation process. The validation process is more in-depth in case of Complete Submission since we have the way to do it for mzIdentML and PRIDE XML files. Upon unsuccessful validation you will be notified on the problem at hand and asked to make corrections or upload additional data. Upon successful validation your dataset will be submitted to PRIDE. If submission has been completed you will be sent an automatic e-mail with the necessary details followed by a manual e-mail from a curator. The details will contain the new PX Identifier, a PX reviewer account (see next point) for private data access and our recommendation on how to reference the PX dataset in your manuscript. In case of Complete Submissions additional PRIDE Experiment Accessions will be assigned to the mzIdentML and PRIDE XML Result files.

4. Accessing Private Data

Submitted datasets are 'private' by default, which means you need to be logged-in to view your data. We will however also create a PX reviewer account for your submission which you can include in your manuscript. The PX reviewer account will give you access to all of the files belonging to your submission. You can access the private dataset files in two ways: via the PRIDE Archive web page or via PRIDE Inspector.

4A PRIDE Archive web page

PRIDE Archive web site is available at http://www.ebi.ac.uk/pride/archive. Registered submitters can use their personal accounts or the reviewer accounts to access and download the individual PX datasets. For every submission there is a separate reviewer account generated.

Please navigate first to the login page available at http://www.ebi.ac.uk/pride/archive/login (see Figure 3):

PRIDE Archive Login Page

Figure 3: PRIDE Archive Login Page.

Once logged in with your registered User (the e-mail account you used to register in PRIDE) or an issued Reviewer Account you are going to see the private dataset/s listed.

 

4B PRIDE Inspector

For this you need to use the new version of PRIDE Inspector.

The following applies for both Complete and Partial Submissions:

Open PRIDE Inspector by clicking on the pride-inspector-<version-number>.jar file in the tool's working directory -> Private Download -> ProteomeXchange -> PX reviewer account details. You can open the PRIDE XML and mzIdentML result files with PRIDE Inspector or just download all the files that you wish to investigate.

Inspector's Private Download

Figure 4: Downloading data with the reviewer account using PRIDE Inspector' Private Download option.

In case of Complete Submissions you can alternatively launch PRIDE Inspector with a WebStart URL provided in the automatic "Submission Complete" e-mail. This option is for downloading the PRIDE XML and mzIdentML files only into a target folder. In order to use the PRIDE Inspector Java Web Start option to display your data there is a waiting period of up to one day upon getting the automatic "Submission Complete" e-mail.

5. Post-submission steps

The particular post-submission steps include the following: modifying the original dataset, referencing the dataset in your manuscript, and making the dataset public.

1. Modifying the original dataset

In case you need to add to a small number of additional "other files" (like csv, plain text files, spreadhseets, scripts) we can provide you with FTP details to upload and can add these to the original dataset without you resubmitting the whole dataset. In case you have used the PX Submission Tool and you need to add additional raw files and accompanying result or search files, you need to resubmit the whole dataset again. Please follow the procedure here. In case of an Aspera bulk submission you have to upload the modified and missing files into a new subdirectory within your target directory and regenerate the submission summary file including all the old and modified and new files again.

2. Referencing the dataset in the paper

By default we recommend to add the following formula to your manuscript (typically in "Material and Method"s or just before/in the Acknowledgements):

The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium (http://proteomecentral.proteomexchange.org) via the PRIDE partner repository [1] with the dataset identifier <PXD000xxx>."

[1] and also for general PRIDE reference, please use: Vizcaino JA, Cote RG, Csordas A, Dianes JA, Fabregat A, Foster JM, Griss J, Alpi E, Birim M, Contell J, O'Kelly G, Schoenegger A, Ovelleiro D, Perez-Riverol Y, Reisinger F, Rios D, Wang R, Hermjakob H. The Proteomics Identifications (PRIDE) database and associated tools: status in 2013. Nucleic Acids Res. 2013 Jan 1;41(D1):D1063-9. doi: 10.1093/nar/gks1262. Epub 2012 Nov 29. PubMed PMID:23203882.

Additionally we'd like to ask you to also put this information in a much abridged form into the abstract itself, like this: "The data have been deposited to the ProteomeXchange with identifier <PXD000xxx>." See for example this Chromosome-Centric Human Proteome Project dataset and paper: http://www.ncbi.nlm.nih.gov/pubmed/?term=23312004 and other examples on PubMed. A PX Identifier in the abstract makes the dataset much more visible and accessible.

3. Releasing the dataset

By default, your data will be made publicly available after your manuscript has been accepted, or when we have your instructions to do so. While we may also receive acceptance notifications from some journals, we would like to ask all submitters to kindly notify us separately. Otherwise, it can happen that we don’t now that your manuscript is already published. You can notify us two ways:

 

A), Via the PRIDE  Archive web site (http://www.ebi.ac.uk/pride/archive). Once you have logged in with your user account at http://www.ebi.ac.uk/pride/archive/login you can click the green “Publish” buttons located next to your unpublished datasets. Here you can provide details for your dataset and submit a web form, please see Figure 5.

Public Release

B)  Contacting pride-support@ebi.ac.uk.

 

Upon making the project public, a project page will be released over at ProteomeCentral (http://proteomecentral.proteomexchange.org) and from a particular dataset page an FTP location will be available.