What is variant calling?

Variant calling is the process by which we identify variants from sequence data (Figure 11).

For whole genome or exome variant calling we follow a three step process:

  1. Carry out whole genome or whole exome sequencing to create FASTQ files.
  2. Align the sequences to a reference genome, creating BAM or CRAM files.
  3. Identify where the aligned reads differ from the reference genome and write to a VCF file.

A CRAM file aligned to a reference genomic region as visualised in Ensembl. Differences are highlighted in red in the reads, and will be called as variants.

Figure 11 A CRAM file aligned to a reference genomic region as visualised in Ensembl. Differences are highlighted in red in the reads, and will be called as variants.

Somatic versus germline variant calling

In germline variant calling, the reference genome is the standard for the species of interest. This allows us to identify genotypes. As most genomes are diploid, we expect to see that at any given locus, either all reads have the same base, indicating homozygosity, or approximately half of all reads have one base and half have another, indicating heterozygosity. An exception to this would be the sex chromosomes in male mammals.

In somatic variant calling, the reference is a related tissue from the same individual. Here, we expect to see mosaicism between cells.