Variant identification and analysis

A likely workflow in human genetic variation studies is the analysis and identification of variants associated with a specific trait or population. Bioinformatics is key to each stage of this process and is essential for handling genome-scale data. It also provides us with a standardised framework to describe variants.

In this section we will learn about the major steps in the process of variant calling, the VCF file format and variant identifiers. We will also examine the value of prediction in determining impact of variation on protein function and structure.  

What is variant calling?

Variant calling is the process by which we identify variants from sequence data (Figure 11).

  1. Carry out whole genome or whole exome sequencing to create FASTQ files.
  2. Align the sequences to a reference genome, creating BAM or CRAM files.
  3. Identify where the aligned reads differ from the reference genome and write to a VCF file.

Identify where the aligned reads differ from the reference genome and write to a VCF file.

Figure 11 A CRAM file aligned to a reference genomic region as visualised in Ensembl. Differences are highlighted in red in the reads, and will be called as variants.

Somatic versus germline variant calling

In germline variant calling, the reference genome is the standard for the species of interest. This allows us to identify genotypes. As most genomes are diploid, we expect to see that at any given locus, either all reads have the same base, indicating homozygosity, or approximately half of all reads have one base and half have another, indicating heterozygosity. An exception to this would be the sex chromosomes in male mammals.

In somatic variant calling, the reference is a related tissue from the same individual. Here, we expect to see mosaicism between cells.