The basic workflow for analysing NGS raw data depends on the type of experiment (e.g. RNA-seq or ChIP-seq) but generally involves the following steps:
- Mapping of sequence reads to reference genome;
- Normalisation and statistical analysis.
Table 2 lists some of the typical file formats that are associated with sequencing experiments in ArrayExpress. FASTQ is the most frequent format in which raw sequencing data are submitted, and ENA converts other sequence data files into FASTQ format.
Table 2 Common raw and processed data formats in NGS experiments
Text file storing raw sequence output together with the quality score for each nucleotide in ASCII code.
|FastQC, Fastx toolkit (for quality control)|
|SAM||“Sequence Alignment/Map”. Output of short-read sequence aligners, contains information about the sequence and its alignment to the reference.||SAMtools|
|BAM||“Binary Alignment/Map”. See SAM (BAM is more widely used due to smaller file size).||SAMtools|
|BED||“Browser Extensible Data”. Used for viewing alignments in a genome browser as annotation track.||Genome browser e.g. Ensembl (17), UCS (18), IGV (19)|
Many of the analysis platforms mentioned earlier allow you to perform gene expression analysis starting with raw sequence data. These platforms include R/Bioconductor (20), Galaxy (10) or GenePattern (11). Processed data matrices containing RPKM or similar read count values can be analysed analogous to microarray data. Popular R/Bioconductor packages to perform differential expression analysis of RNA-sequencing data are DESeq2 (21) and edgeR (22).