Machine Learning Based Classification of Diffuse Large B-cell Lymphoma Patients by their Protein Expression Profiles
Characterization of tumors at the molecular level has improved our knowledge of cancer causation and progression. Proteomic analysis of their signaling pathways promises to enhance our understanding of cancer aberrations at the functional level, but this requires accurate and robust tools. Here, we develop a state of the art quantitative mass spectrometric pipeline to characterize formalin-fixed paraffin-embedded (FFPE) tissues of patients with closely related subtypes of diffuse large B-cell lymphoma (DLBCL). We combined a super-SILAC approach with label-free quantification (hybrid LFQ), to address situations where the protein is absent in the super-SILAC standard yet present in the patient samples. Shotgun proteomic analysis on a quadrupole Orbitrap quantified almost 9000 tumor proteins in 20 patients. The quantitative accuracy of our approach allowed the segregation of DLBCL patients according to their cell-of-origin, using both their global protein expression patterns and the 55-protein signature obtained previously from patient-derived cell lines (Deeb et al. MCP 2012 PMID 22442255). Expression levels of individual segregation-driving proteins as well as categories such as extracellular matrix proteins behaved consistent with known trends between the subtypes. We employed machine learning (support vector machines) to extract candidate proteins with the highest segregating power. A panel of four proteins (PALD1, MME, TNFAIP8 and TBC1D4) classified the patients with very low error rates. Highly ranked proteins from the support vector analysis revealed differential expression of core signaling molecules between the subtypes, elucidating aspects of their pathobiology.
Sample Processing Protocol
Protein extraction from FFPE DLBCL tissues – For each patient sample, two FFPE slices of macro-dissected tissue were collected. They were processed for mass-spectrometry-based proteome analysis by extraction and digestion according to the Filter Aided Sample Preparation (FASP) protocol (FFPE-FASP) (17, 21). In short, FFPE tissue slices were incubated in 1 ml xylene (2x) with gentle agitation for 5 min at room temperature. After removing the paraffin, the samples were dried by incubating them in 1 ml absolute ethanol (2x). The dried samples were then lysed in a buffer consisting of 0.1 M Tris - HCl (pH 8.0), 0.1 M DTT and 4% SDS. After homogenization using a disperser, they were boiled at 99 °C using a heating block with agitation (600 rpm) for 60 min. The samples were then cleared by centrifugation. Protein digestion and peptide fractionation – On a 30 KDa filter (Millipore, Billerica, MA, USA), 100 µg of each of the patient samples and the super-SILAC mix were mixed. The samples were further processed by the FASP method in which the SDS buffer is exchanged with a urea buffer (21). This was followed by alkylation with iodoacetamide and overnight digestion by trypsin at 37°C in 50 mM ammonium bicarbonate. The tryptic peptides were collected by centrifugation and elution with water (2x). Strong anion exchange (SAX) chromatography was used to fractionate 40 µg of peptides from each patient sample (22). It was performed in tip-based columns from 200 µl micropipette tips stacked with 6 layers of a 3M Empore anion exchange disk (1214-5012; Varian, Palo Alto, CA). For the fractionation, a Britton & Robinson universal buffer (20 mM acetic acid, 20 mM phosphoric acid, and 20 mM boric acid) was used and titrated using NaOH to six buffers with the desired pHs (pH 11, 8, 6, 5, 4, and 3). Subsequently, six fractions from each sample were collected, followed by desalting the eluted fractions on reversed phase C18 Empore disc StageTips (23). The peptides were eluted from the StageTips using 20 µl of buffer B composed of 80% ACN in 0.5% acetic acid (2x). A SpeedVac concentrator prepared the samples for MS analysis by removing the organic solvents.
Data Processing Protocol
Data analysis – We used the MaxQuant software environment (version 184.108.40.206) to analyze MS raw data. The MS/MS spectra were searched against the Uniprot database (81,213 entries, release 2012) using the Andromeda search engine incorporated in the MaxQuant framework (24, 25). Cysteine carbamidomethylation was set as a fixed modification and N-terminal acetylation and methionine oxidation as variable modifications. The maximum false discovery rate for both peptide and protein identifications was set to 0.01. Strict specificity for trypsin cleavage was required allowing cleavage N-terminal to proline. The minimum required peptide length was seven amino acids with a maximum of two miscleavages allowed. The initial precursor mass tolerance was 4.5 ppm and for the fragment masses it was up to 20 ppm. Time-dependent recalibration algorithm of MaxQuant was used to improve the precursor mass ions mass accuracy. The “match between runs” option was enabled, allowing the matching of identifications across measurements. Relative quantification of the peptides against their SILAC-labeled counterparts was performed with MaxQuant using a minimum ratio count of 1. We combine SILAC with label-free analysis (‘hybrid algorithm’) employing a minimum count of 1 (see RESULTS AND DISCUSSION). Our in-house statistical software Perseus (Perseus-framework.org) was used for further statistical and bioinformatic analysis of the MaxQuant output data. Missing values were supplied by ‘data imputation’ to simulate signals of low abundant proteins under the assumption that they are biased toward the detection limit of the MS measurement (18).
Deeb SJ, Tyanova S, Hummel M, Schmidt-Supprian M, Cox J, Mann M. Machine Learning Based Classification of Diffuse Large B-cell Lymphoma Patients by their Protein Expression Profiles. Mol Cell Proteomics. 2015 Aug 26. pii: mcp.M115.050245 PubMed: 26311899