Data curation and analysis
All data are manually curated
Expression Atlas contains thousands of selected microarray and RNA-sequencing (RNA-seq) experiments from public repositories such as ArrayExpress, European Nucleotide Archive (ENA) at EMBL-EBI and Gene Expression Omnibus (GEO) at NCBI. Controlled access datasets from the European Phenome-Genome Archive (EGA) and the database of Genotypes and Phenotypes (dbGAP) are also selected and included in Expression Atlas. Current criteria for selection and inclusion of a gene expression dataset in Expression Atlas are:
- the study must be of general interest
- it must be performed on a species from which a good quality reference genome build is available
- for microarray data, it must be possible to re-annotate the array design against Ensembl
- the study must include at least three biological replicates
- clear experimental variables must be available
The selected datasets are manually curated by PhD biologists. Curation in Expression Atlas involves a critical review of each dataset to provide a comprehensive representation of gene expression data. We extract and structure information from the literature to enrich the annotation of each sample by adding more metadata.
All data are re-analysed using standardised methods
Expression Atlas has re-analysed more than 3,000 experiments. Microarray raw data are analysed using different packages from Bioconductor depending on the array platform used to perform the experiment.
More than 500 RNA-seq experiments have been re-analysed by Expression Atlas. RNA-seq data are analysed using the open source iRAP pipeline, which is available through this github repository. RNA-seq experiments in Expression Atlas include large landmark studies such as GTEx, CCLE, ENCODE or HipSci (Figure 2).