Project PXD014003

Summary

Title

YPIC challenge 2018: A case study in characterizing an unknown protein sample

Description

For the YPIC challenge 2018 contestants were invited to try to decipher two unknown English questions encoded by a synthetic protein expressed in E. coli. We present how we analyzed this unknown sample using a tryptic digest with dynamic exclusion disabled to increase the signal-to-noise ratio of the measured molecules. Subsequently, spectral clustering was used to generate high-quality consensus spectra and condense the acquired MS/MS spectral data. De novo spectrum identification was used to determine the English questions encoded by the synthetic protein, and any post-translational modifications introduced by E. coli on the synthetic protein were detected using spectral networking. Although the synthetic protein sample for the YPIC challenge 2018 is not of biological interest, the experimental and computational strategy presented here can be directly used to analyze samples for which no protein sequence information is available. All software and code to perform the bioinformatics analysis is available as open source, and a self-contained Jupyter notebook is provided to fully recreate the analysis.

Sample Processing Protocol

For the YPIC challenge 2018 contestants were invited to try to decipher two unknown English questions encoded by a synthetic protein expressed in E. coli. We present how we analyzed this unknown sample using a tryptic digest with dynamic exclusion disabled to increase the signal-to-noise ratio of the measured molecules. Subsequently, spectral clustering was used to generate high-quality consensus spectra and condense the acquired MS/MS spectral data. De novo spectrum identification was used to determine the English questions encoded by the synthetic protein, and any post-translational modifications introduced by E. coli on the synthetic protein were detected using spectral networking. Although the synthetic protein sample for the YPIC challenge 2018 is not of biological interest, the experimental and computational strategy presented here can be directly used to analyze samples for which no protein sequence information is available. All software and code to perform the bioinformatics analysis is available as open source, and a self-contained Jupyter notebook is provided to fully recreate the analysis. --- We received a sample vial containing 12.5ug of an unknown protein via mail from the organizers of the YPIC challenge. The sample was reconstituted with 125ug 0.1% formic acid (final concentration 0.1ug/uL protein). An aliquot (1ug}; 10uL) of reconstituted sample was reduced (50uM dithiothreitol), alkylated (150uM iodoacetamide), and digested with Promega trypsin (1:50 enzyme—substrate ratio; 0.02ug trypsin) for 4h at 37°C with shaking. Digested peptides were concentrated via speed-vac to a final concentration of 0.33fmol/uL. ​Peptides were separated with a Waters NanoAcquity UPLC and emitted into a Thermo Q-Exactive HF tandem mass spectrometer. Pulled tip columns were created from 75um inner diameter fused silica capillary in-house using a laser pulling device and packed with 2.1um C18 beads (Dr. Maisch GmbH) to 300mm. Trap columns were created from 150mm inner diameter fused silica capillary fritted with Kasil on one end and packed with the same C18 beads to 25mm. Buffer A was water and 0.1% formic acid, while buffer B was 98% acetonitrile and 0.1% formic acid. For each injection, 3uL of each sample was loaded with 5uL 2% B and eluted using the following program: 0min to 90min 2% to 35% B, 90min to 100min 35% to 60% B, followed by a 35min washing gradient. ​The Thermo Q-Exactive HF was set to positive mode in a top-20 configuration. Precursor scans (300m/z to 2000m/z) were collected at 60,000 resolution to hit an AGC target of 3e6. The maximum inject time was set to 100ms. Fragment scans were collected at 30,000 resolution to hit an AGC target of 1e5 with a maximum inject time of 55ms. The isolation width was set to 1.6m/z with a normalized collision energy of 27. Precursors with charge up to +6 that achieved a minimum AGC of 5e3 were acquired. Dynamic exclusion was disabled. The digested sample was acquired using this method in technical triplicate.

Data Processing Protocol

Raw files were converted to the MGF format using msconvert for further processing. During conversion MS/MS spectra were centroided using the vendor algorithm and the precursor m/z and charge was recalculated based on the preceding MS scan. Next, MS/MS spectra were clustered and consensus spectra were generated using MaRaCluster with a similarity p-value threshold of 1e-5, precursor mass tolerance 50ppm, and requiring at least 3 MS/MS spectra per cluster. After spectral clustering low-quality clusters were removed by only retaining the clusters that represent at least 10 original spectra and whose consensus spectra have precursor charge 2 or 3. The high-quality consensus spectra were used for de novo spectrum identification and spectral networking. DeNovoGUI was used as a unified interface to the Novor, DirecTag, and PepNovo+ de novo search engines. Settings for de novo spectrum identification were precursor mass tolerance 20ppm; fragment mass tolerance 0.02Da; and cysteine carbamidomethylation, methionine oxidation, and acetylation of the peptide N-terminus as variable modifications. PSMs were visualized and manually investigated using DeNovoGUI. A spectral network was constructed using the high-quality consensus spectra. Prior to matching spectra to each other they were preprocessed by removing noise peaks with an intensity below 5% of the base peak intensity and at most the 150 most intense peaks were retained. Next, peak intensities were scaled by their square root before being normalized by their norm to have a magnitude of one. The shifted dot product was used to match modified spectra to each other with fragment mass tolerance 0.02Da. Each consensus spectrum formed a node in the spectral network, with an edge between two nodes if the shifted dot product between the two corresponding spectra was greater than or equal to 0.8. Peptide sequences were assigned to nodes in the spectral network if the corresponding consensus spectra could be identified by Novor with a minimum score of 70. Only subgraphs in the spectral network consisting of at least three nodes were considered.

Contact

Wout Bittremieux, University of Antwerp
William Stafford Noble, Department of Genome Sciences, University of Washington, Seattle, WA, USA ( lab head )

Submission Date

24/05/2019

Publication Date

11/06/2019

Tissue

Not available

Instrument

Q Exactive

Software

Not available

Experiment Type

Shotgun proteomics

Publication

Publication pending