MGnify Proteins

Protein sequences are derived from the analysis of publicly available metagenomics assemblies within MGnify using our combined gene caller (which uses both Prodigal and FragGeneScan). Sequences are assigned an MGYP accession. MGYPs are non-redundant, so proteins with the same sequence are assigned the same MGYP identifier. Sequences are mapped to the assembly (ERZ) and contig (MGYC) location where they are found. The biome assignments come from the biome assigned to the assembly in MGnify. The sequences are then clustered using MMseqs/linclust, with coverage and sequence identity thresholds set at 90%.

Functional annotations are provided in two different ways. Pfam annotations are provided for the proteins by running HMMER using the Pfam significance thresholds (i.e. using --cut-ga parameter in HMMER). We also provide annotations produced by ProtENN2 which uses convolutional neural networks to annotate each residue in the database with a Pfam family (or clan) label, which is then converted into domain calls.

In the fasta files the header line includes fields FL (1=full-length sequence, 0=partial sequence) and CR (1=cluster representative).

BETA