Methods and protocols

Antibiotic lookup

Each compound is queried against the Ontology Lookup Service first against the Antibiotic Resistance Ontology (ARO) and then against Chemical Entities of Biological Interest (ChEBI) should no record be found. ARO queries are restricted to all children of ARO:1000003 antibiotic molecule. ChEBI queries are similar restricted to all children of CHEBI:33281 antimicrobial agent. Should no record be found we omit any details.

Phenotype data generation

Data sets of literature containing values for minimum inhibitory concentrations were curated by the Comprehensive Assessment of Bacterial-Based AMR prediction from Genotypes (CABBAGE) project based at Imperial College London. These data were further harmonised by the Samples, Phenotypes, and Ontologies Team at EMBL-EBI to provide normalisation including:

Antibiotic names and ontological terminology
Accessions linking to BioSamples and ENA
Controlled vocabulary for term restricted fields
Column naming conventions

Where possible original CABBAGE data has been retained. In addition to this normalisation these records will be brokered back to BioSamples ensuring further persistence of these data beyond the portal.

Genotype data generation

Genomes from the Comprehensive Assessment of Bacterial-Based AMR prediction from Genotypes (CABBAGE) were annotated with mettannotator providing exhaustive annotation of prokaryotic genomes. AMRFinderPlus and UniFire (the UniProt Functional annotation Inference Rule Engine) are executed on these annotations to provide predictions of AMR.

Annotation using mettannotator

mettannotator is a bioinformatics pipeline that generates an exhaustive annotation of prokaryotic genomes using existing tools. The output is a GFF file that integrates the results of all pipeline components. Version 4.0.23 of AMRFinderPlus was used alongside database version 4.0 2025-07-16.1. Mettannotator version used to annotate each genome in the portal is shown in the "Annotation tool version" column.

Mettannotator can be executed in two modes: Fast or Full. Fast mode skips InterProScan, UniFire, and SanntiS. The mode that was used for each genome is shown in the "Annotation tool mode" column. Over time, more genomes will be reannotated with the Full version of mettannotator.

Parsing of results

GFF and AMRFinderPlus' outputs are parsed using our Python tool. We select records related to the class AMR and exclude those annotated as STRESS or VIRULENCE. If a genome is processed but no AMR annotations are identified, its genotype is not displayed in the portal and is not included in the Parquet files; however, the full annotation remains available for download via FTP. Records which are supported by a hidden Markov model (HMM) have their accession noted.

Normalisation

Additional antibiotic processing

Antibiotic records are taken from the class and subclass output from AMRFinderPlus. Where these output are the same, we indicate this is an annotation at the level of a class of antimicrobial compound. Where they differ we assume this will refer to a specific compound. Where we have records where the class is set to a value such as AMINOGLYCOSIDE and subclass is set to SPECTINOMYCIN/STREPTOMYCIN we interpret this as two separate calls on AMR and represent it as two records in our resource. We then use the previously described algorithm to retrieve the ontological term.

Archive identifiers

We ensure records are linked to the following archives

assembly_ID: Genome Collection Accession (GCA) or Third Party Annotation (ERZ) made available from the European Nucleotide Archive and National Center for Biotechnology Information and the International Nucleotide Sequence Database Collection (INSDC)
BioSample_ID: BioSample identifier available from EMBL-EBI and Genbank. BioSample is linked to a genotype record via the GCA and retrieved via the European Bioinformatic Institute's European Bioinformatics Institute's (EMBL-EBI) web API.
taxon_id: Node identifier for the genome in the taxonomic tree

Additional data information

Protein identifiers

Protein identifiers are based on the GCA for a genome and autoincremented number starting at 00001. Identifiers are consistent and unique within a genome.