Quantification

The simplest approach to quantifying gene expression by RNA-seq is to count the number of reads that map (i.e. align) to each gene (read count) using programs such as HTSeq-count. This gene-level quantification approach utilises a gene transfer format (GTF) file containing gene models, with each model representing the structure of transcripts produced by a given gene.

Raw read counts are affected by factors such as transcript length (longer transcripts have higher read counts, at the same expression level) and total number of reads. Thus, if we want to compare expression levels between samples, we need to normalise the raw read counts. The measure RPKM (reads per kilobase of exon model per million reads) and its derivative FPKM (fragments per kilobase of exon model per million reads mapped) account for both gene length and library size effects

Correcting for gene length is not necessary when comparing changes in gene expression within the same gene across samples. However, it is necessary for correctly ranking gene expression levels within the sample to account for the fact that longer genes accumulate more reads (at the same expression level).