Data curation and analysis

All data are manually curated

Expression Atlas contains thousands of selected microarray and RNA-sequencing (RNA-seq) experiments from public repositories such as ArrayExpressEuropean Nucleotide Archive (ENA) at EMBL-EBI and Gene Expression Omnibus (GEO) at NCBI. Controlled access datasets from the European Phenome-Genome Archive (EGA) and the database of Genotypes and Phenotypes (dbGAP) are also selected and included in Expression Atlas. Current criteria for selection and inclusion of a gene expression dataset in Expression Atlas are: 

  • the study must be of general interest
  • it must be performed on a species from which a good quality reference genome build is available
  • for microarray data, it must be possible to re-annotate the array design against Ensembl
  • the study must include at least three biological replicates
  • clear experimental variables must be available

The selected datasets are manually curated by PhD biologists. Curation in Expression Atlas involves a critical review of each dataset to provide a comprehensive representation of gene expression data. We extract and structure information from the literature to enrich the annotation of each sample by adding more metadata.

All data are re-analysed using standardised methods

Expression Atlas has re-analysed more than 3,000 experiments. Microarray raw data are analysed using different packages from Bioconductor depending on the array platform used to perform the experiment.

More than 500 RNA-seq experiments have been re-analysed by Expression Atlas. RNA-seq data are analysed using the open source iRAP pipeline, which is available through this github repository. RNA-seq experiments in Expression Atlas include large landmark studies such as GTExCCLEENCODE or HipSci (Figure 2).

Figure 2 Expression Atlas includes large landmark RNA-seq studies.