Critical decisions in metaproteomics: achieving high confidence protein annotations in a sea of unknowns.
Environmental meta-omics is rapidly expanding as sequencing capabilities improve, computing technologies become more accessible, and associated costs are reduced. The in situ snapshots of marine microbial life afforded by these data provide a growing knowledge of the functional roles of communities in ecosystem processes. Metaproteomics allows for the characterization of the dynamic proteome of a complex microbial community. It has the potential to reveal impacts of microbial metabolism on biogeochemical transport, storage and cycling (for example, Hawley et al., 2014), while additionally clarifying which taxonomic groups perform these roles. Previous work illuminated many of the important functions and interactions within marine microbial communities (for example, Morris et al., 2010), but a review of ocean metaproteomics literature revealed little standardization in bioinformatics pipelines for detecting peptides and inferring and annotating proteins. As prevalence of these data sets grows, there is a critical need to develop standardized approaches for mass spectrometry (MS) proteomic spectrum identification and annotation to maximize the scientific value of the data obtained. Here, we demonstrate that bioinformatics decisions made throughout the peptide identification process are as important for data interpretation as choices of sampling protocol and bacterial community manipulation experimental design. Our analysis offers a best practices guide for environmental metaproteomics.
Sample Processing Protocol
Our study followed traditional procedures currently employed in ocean metaproteomics (details in Supplementary Information 1). Water samples were collected and selectively filtered from the Bering Strait as described in May et al. (2016) and incubated shipboard over 10 days (T0=day 0, T10=day 10). Bacterial community proteomes from the incubations were analyzed on a Q-Exactive-HF (Thermo Fisher Scientific, Waltham, MA, USA) and resulting data were searched against four different peptide identification databases (Supplementary Information 2): (1) site/time-specific metagenome collected concurrently with the incubated water; (2) NCBI’s env_NR database; (3) Arctic-bacterial database of NCBI protein sequences from known polar taxonomic groups (Supplementary Information 3) North Pacific database derived from a subset of the Ocean Microbiome sequencing project (Sunagawa et al., 2015; Supplementary Information 4). Peptides were identified and proteins were inferred using Comet v. 2015.01 rev. 2 (Eng et al., 2012, 2015), followed by peptide and protein match scoring (Pedrioli, 2010; Deutsch et al., 2015) at a false discovery rate threshold of 0.01 (Supplementary Information 5). Proteins from all databases were annotated using BLASTp (Altschul et al., 1990; Camacho et al., 2009) against the UniProtKB TrEMBL database (downloaded April 28, 2015) with an e-value cutoff of 1E-10 (Supplementary Information 6). Shifts in community biological functions over the 10-day incubation were quantified using a Gene Ontology (GO) analysis where peptide spectrum matches were associated with GO terms. Additionally, database-driven peptide score sensitivity as a function of database size was investigated by searching the site/time-specific metagenome database with increasing numbers of decoy peptides.
Data Processing Protocol
The number of peptide experimental spectra that yielded spectrum matches was very different among databases. The highest number of confidently scored unique peptide matches and protein inferences resulted from the search against the site/time-specific metagenome database. This number of peptide matches was augmented 1.5 times by searching the same data against unassembled reads. This ‘metapeptide’ approach (May et al., 2016) avoids sequence loss and potential noise introduced by read assembly (for example Cantarel et al., 2011). The peptides identified by the four assembled databases overlapped relatively little, suggesting that the different databases cover different parts of the acquired metaproteome (May et al., 2016). In a direct comparison of the unassembled metagenome peptides and env_NR, the metagenome contained more peptides from the metaproteome (May et al., 2016). Additionally, database size, especially in the cases of env_NR and North Pacific, had a substantial impact on search sensitivity, making statistically confident detection of peptides difficult (Supplementary Information 7; May et al., 2016). In agreement with others, we found large database searches suffer from a loss of statistical power from multiple hypothesis testing against the vast number of sequences unrepresented in the expressed metaproteome (Nesvizhiskii, 2010; Jagtap et al., 2013; Tanca et al., 2013). This paradox of too many sequences resulting in too few identifications will become increasingly problematic with the availability of more sequence data. Our results point to the success obtained by searching a metaproteome-specific database that excludes non-specific sequences, while balancing the need to retain a sufficient amount of sequence variation.
Timmins-Schiffman E, May DH, Mikan M, Riffle M, Frazar C, Harvey HR, Noble WS, Nunn BL. Critical decisions in metaproteomics: achieving high confidence protein annotations in a sea of unknowns. ISME J. 2017 Feb;11(2):309-314 PubMed: 27824341