- Course overview
- Search within this course
- What is antimicrobial resistance?
- How do we study pathogens?
- Public pathogen data
- A guide to the Pathogens Portal
- Identification and investigation of antimicrobial resistance genes
- Looking for antimicrobial resistance genes in different environments
- Data sharing
- The future of AMR
- Crossword: Test your knowledge
- Your feedback
- Further resources
- Help and support
- Glossary
- References
Gene annotation
Following assembly, the genome needs to be annotated in order to understand the structure and functions of the genetic sequence.
Gene prediction
The first step in annotation is gene prediction, which is the process of identifying coding regions within the genome i.e. sequences that encode proteins or functional RNA molecules. Genes can be predicted based on the statistical properties of the genetic sequence, for example:
- Open reading frames (ORFs) – The regions between start and stop codons that can be translated into protein
- Codon usage bias – Coding regions (i.e. genes) use certain codons at different frequencies compared to non-coding regions
- GC content – The percentage of guanine (G) or cytosine (C) nucleotides differs between coding and non-coding regions
- Splice site recognition – Intron-exon boundaries in eukaryotic genomes
Alternatively genes can be predicted based on external evidence, for example whether there is shared homology with known genes, or experimental data available such as RNA or protein sequences.
Annotation approaches
Once the genes have been predicted they can be assigned a biological function. Annotation is a complex process that involves using various different approaches such as:
- Sequence similarity – often similar nucleotide sequences imply similar functions
- Protein domains – conserved protein domains can be indicative of specific functions
- Motifs and signal peptides – short conserved sequences can have designated functions
- Experimental data – functions can be elucidated from data such as RNA-seq, protein interactions, or gene knock-out experiments
- Pathway analysis – Mapping genes to biological pathways can infer functions in metabolic or regulatory pathways
The development of automated annotation pipelines (e.g. Ensembl) and bioinformatic tools (e.g. bakta, prokka) make high-throughput functional annotation on large datasets possible, nevertheless manual curation by experts is still an important practice as it ensures accuracy among annotations.