0%

Gene annotation

Following assembly, the genome needs to be annotated in order to understand the structure and functions of the genetic sequence.

Gene prediction

The first step in annotation is gene prediction, which is the process of identifying coding regions within the genome i.e. sequences that encode proteins or functional RNA molecules. Genes can be predicted based on the statistical properties of the genetic sequence, for example: 

  • Open reading frames (ORFs) – The regions between start and stop codons that can be translated into protein
  • Codon usage bias – Coding regions (i.e. genes) use certain codons at different frequencies compared to non-coding regions 
  • GC content – The percentage of guanine (G) or cytosine (C) nucleotides differs between coding and non-coding regions
  • Splice site recognition – Intron-exon boundaries in eukaryotic genomes

Alternatively genes can be predicted based on external evidence, for example whether there is shared homology with known genes, or experimental data available such as RNA or protein sequences. 

Annotation approaches

Once the genes have been predicted they can be assigned a biological function. Annotation is a complex process that involves using various different approaches such as: 

  • Sequence similarity – often similar nucleotide sequences imply similar functions 
  • Protein domains – conserved protein domains can be indicative of specific functions
  • Motifs and signal peptides – short conserved sequences can have designated functions 
  • Experimental data – functions can be elucidated from data such as RNA-seq, protein interactions, or gene knock-out experiments  
  • Pathway analysis – Mapping genes to biological pathways can infer functions in metabolic or regulatory pathways 

The development of automated annotation pipelines (e.g. Ensembl) and bioinformatic tools (e.g. bakta, prokka) make high-throughput functional annotation on large datasets possible, nevertheless manual curation by experts is still an important practice as it ensures accuracy among annotations.