Read mapping or alignment
Once high-quality data are obtained from pre-processing, the next step is the read mapping or alignment. There are two main options depending on the availability of a genome sequence (Figure 10):
- When studying an organism with a reference genome, it is possible to infer which transcripts are expressed by mapping the reads to the reference genome (genome mapping) or transcriptome (transcriptome mapping). Mapping reads to the genome requires no knowledge of the set of transcribed regions or the way in which exons are spliced together. This approach allows the discovery of new, unannotated transcripts.
- When working on an organism without a reference genome, reads need to be assembled first into longer contigs (de novo assembly). These contigs can then be considered as the expressed transcriptome to which reads are re-mapped for quantification.
There are many bioinformatics tools available to perform the alignment of short reads. One of the most popular RNA-seq mappers is TopHat, which aligns reads in two steps:
- unspliced reads are mapped to locate exons (with Bowtie)
- unmapped reads are then split and aligned independently to identify exon junctions (9)
The RNA-seq read alignment program currently used by the Expression Atlas pipeline is called HISAT2, which stands for “hierarchical indexing for spliced alignment of transcripts 2”, and provides more accurate results with fast and sensitive alignment. HISAT2 uses a graph-based approach to index the reference genome, combined with the Bowtie2 algorithm for alignment (11).
It is important to check the quality of the mapping process. The percentage of mapped reads is a global indicator of the overall sequencing accuracy and of the presence of contaminating DNA. Picard can be used for quality control in mapping.
Either with a reference or de novo assembly, the complete reconstruction of transcriptomes using short reads is challenging. For example, short reads can sometimes align equally well to multiple locations (multi-mapped reads or multi-reads). Paired-end reads reduce the problem of multi-mapping, because a pair of reads must map within a certain distance of each other and in a certain order (Figure 8). Finally, long-read technologies, such as SMRT from Pacific Biosciences, provide reads that are long enough to sequence complete transcripts for most genes and are a promising alternative.