Frequently Asked Questions

General questions


What is Expression Atlas

Expression Atlas is a resource to query gene and protein expression data across species and biological conditions and to visualise down-stream analysis results to explore co-expression. Queries can be either in a baseline context, e.g. find genes expressed in the macaque brain, or in a differential context, e.g. find genes that are up or downregulated in response to auxin in Arabidopsis. Expression Atlas contains thousands of selected microarray and RNA-sequencing data that are manually curated and annotated with ontology terms, checked for high quality and re-analysed using standardised methods.

How are experiments chosen to be in Expression Atlas?

Experiments in Expression Atlas are selected from ArrayExpress that is an archive of functional genomics data. Different criteria are used for including experiments in Baseline Atlas or in Differential Atlas. All experiments in Baseline Atlas are RNA-seq data with good quality raw files and a broad coverage across conditions (e.g. tissues, cell types, developmental stages). A good quality reference genome build available is also required for RNA-seq data. All experiments in Differential Atlas have enough number of replicates to allow robust statistical analysis. Apart from that, microarray data must be gene expression data (rather than ChiP-chip or CGH for example) and it should be possible to re-annotate the array design against Ensembl.

How can I obtain the original data for an experiment?

The original raw and processed data files for experiments in Expression Atlas can be found by viewing the experiment in ArrayExpress. On any Experiment page, e.g. RNA-seq of 934 human cancer cell lines from the Cancer Cell Line Encyclopedia click on the array express button button in the top right corner to view the experiment in ArrayExpress. The original submitted data files can be downloaded as zip archives, and sample annotation is available in MAGE-TAB format text files. See the online tutorial ArrayExpress: Discover functional genomics data quickly and easily for further information.

It doesn't look like the whole of the original experiment is in Expression Atlas, why not?

We sometimes only include part of an experiment in Expression Atlas because (1) there are not sufficient replicates of all the sample groups within an experiment, or (2) the hybridization or sequencing was not of high enough quality. If there are still enough assays in the experiment after the removal of those with too few replicates or low quality then we continue processing the experiment for Expression Atlas.

How is microarray data quality controlled?

Microarray data quality is assessed using the arrayQualityMetrics package in R. Outlier arrays are detected using distance measures, boxplots, and MA plots. If an array is classed as an outlier by all three methods, it is excluded from further analysis. Please see the arrayQualityMetrics documentation for more details on the methods used.

How is RNA-seq data quality controlled?

RNA-seq reads are discarded based on several criteria. First, reads with quality scores less than Q10 are removed. Second, the reads are mapped against a contamination reference genome (E. coli for animal data, fungal and microbial non-redundant reference for plants). Any reads that map to the contamination reference are removed. Third, reads with "uncalled" characters (i.e. "N"s) are discarded. Lastly, for paired-end libraries, any reads whose mate was lost in the previous three steps are also discarded. Please see the iRAP documentation for more details on the methods used.

How is microarray data analysed?

Raw single-channel microarray intensities are normalized using RMA via the oligo package from Bioconductor (Affymetrix data) or using quantile normalization via the limma package (Agilent data). Two-channel Agilent data is normalized using LOESS via the limma package. Pairwise comparisons are performed using a moderated t-test for each gene using limma.

How is RNA-seq data analysed?

RNA-seq data is analysed using the iRAP pipeline. Quality-filtered reads are aligned to the latest version of the reference genome from Ensembl using TopHat2. Raw counts (number of mapped reads summarized and aggregated over each gene) are generated using htseq-count. Then, FPKM (fragments per kilobase of exon model per million mapped reads) and TPM (transcripts per million) are calculated. Pairwise comparisons are performed using a conditioned test based on the negative binomial distribution, using DESeq.

What are FPKM and TPM?

FPKM (fragments per kilobase of exon model per million reads mapped) and TPM (transcripts per million) are the most common units reported to estimate gene expression based on RNA-seq data. Both units are calculated from the number of reads that mapped to each particular gene sequence and both units are calculated taking into account two important factors in RNA-seq:

  1. The number of reads from a gene depends on its length. One expects more reads to be produced from longer genes.
  2. The number of reads from a gene depends on the sequencing depth that is the total number of reads you sequenced. One expects more reads to be produced from the sample that has been sequenced to a greater depth.

FPKM (introduced by Trapnell et al, 2010) are calculated with the following formula:

fpkm formula

where qi are raw counts (number of reads that mapped for each gene), li is gene length and total number mapped reads is the total number of mapped reads. The interpretation of FPKM is that if you sequence your RNA sample again, you expect to see for gene i, FPKMi reads divided by gene i length over a thousand and divided by the total number of reads mapped over a million.

Li and Dewey, 2011 introduced the unit TPM and Pachter, 2011 established the relationship between both units. It is possible to compute TPM from FPKM as follows:

tpm formula

How do I see how an individual experiment was analysed?

On any Experiment page, e.g. RNA-seq of coding RNA of eight barley tissues from different developmental stages you can see a breakdown of the analysis steps from raw data to the results you see in Expression Atlas by clicking the analysis methods button button in the top right corner of the page.

How can I contact you?

If you have any questions, problems or suggestions we would love to hear from you. Please email us at arrayexpress-atlas@ebi.ac.uk.

How can I keep up with the latest Expression Atlas news?

If you would like to stay up-to-date with news about our latest releases and developments, please subscribe to the Expression Atlas mailing list.

Can I use Expression Atlas anatomograms on my website?

Yes! The anatomical diagrams (anatomograms) that you see alongside the baseline data are available from GitHub. The anatomograms are licenced under Creative Commons Attribution (CC BY); if you'd like to use them, all we ask is that you attribute Expression Atlas when you do.

Searching


How do I find out what genes are expressed in a particular condition (e.g. in my favourite tissue, cell line, developmental stage)?

Use the Condition query search box on the home page to search for the condition you are interested in e.g. kidney. Click on Show anatomogram in the top left corner if you want to see the corresponding anatomogram for each of the species displayed. Your query is expanded using the Experimental Factor Ontology (EFO), so that this search will also returns matching synonyms and child terms of kidney in EFO. You will see both baseline expression and differential expression results in the condition (organism part in our example) you searched for.

Select one experiment from the Baseline multi-experiment page (e.g. GTEx) to see the results in that particular experiment. Use the Organism part box to find what genes are expressed in kidney in GTEx.

How do I search for multiple conditions at once?

Use the Condition query search box on the home page to search for as many conditions as you want. You will need to type each condition, click enter and when you are done, just search. For example, searching with liver_heart will find all experiments in which both liver and heart are studied as well as the ones analysing either liver or heart.

How do I find out in which conditions my favourite gene is expressed?

Use the Gene query search box on the home page to search for your favourite gene (e.g. SFTPC). Click on Show anatomogram in the top left corner if you want to see the corresponding anatomogram for each of the species displayed. You will see Baseline expression results in different conditions (e.g. organism part, cell line, cell type, developmental stage) and Differential expression results for biologically meaningful pairwise comparisons.

Select one experiment from the Baseline expression results (e.g. GTEx) to see the results in that particular experiment. Use the Gene query box (e.g. SFTPC in our example) to find in which tissues from GTEx gene SFTPC is expressed.

How do I search for multiple genes at once?

Use the Gene query search box on the home page to search for as many genes as you want. You will need to type each gene, click enter and when you are done, just search.

What gene identifiers can I use to search?

You may use the following identifiers to search using the Gene query box:

How do I find a particular experiment?

You can easily see all experiments in Expression Atlas by clicking in All experiments on the home page. You can narrow down the list of experiments by selecting Baseline or Differential in the first column at the bottom of the table. You can also select Plants or Animal and Fungi experiments (using the second column), experiments from a particular organism (fifth column) or experiments involving a particular variable (sixth column). Click on the Experiment title to see the experiment in Expression Atlas.

If you know the ArrayExpress accession of the experiment you want to see (e.g. E-MTAB-4202), you can link to the experiment in Expression Atlas using the following format: https://www.ebi.ac.uk/gxa/experiments/<ArrayExpress accession>

e.g. https://www.ebi.ac.uk/gxa/experiments/E-MTAB-4202

Are there any alternative ways of searching in Atlas?

It's best if you contact us and we will be able to advise you on your specific use case.

In the meantime, you can construct queries using URLs like the ones in the table below. Please be aware that the format of these URLs is subject to change. If your queries stop working, please check back here for the latest standard or email us at arrayexpress-atlas@ebi.ac.uk.

Query URL
In what conditions is ASPM differentially expressed? https://www.ebi.ac.uk/gxa/search?geneQuery=[{"value":"ASPM"}]#differential
What genes are differentially expressed in cancer? https://www.ebi.ac.uk/gxa/search?conditionQuery=[{"value":"cancer"}]#differential
Show me comparisons where zinc finger genes are differentially expressed in mice. https://www.ebi.ac.uk/gxa/search?geneQuery=[{"value":"zinc finger"}]&organism=Mus musculus#differential

Results


When I search in Expression Atlas, what do the Baseline Expression results show?

If you search for a particular gene (e.g. CFHR2), the Baseline Expression results will display all organisms and conditions in which CFHR2 is expressed above the default minimum expression level of 0.5 FPKM or 0.5 TPM.

By default, we display expression data for different tissues, e.g in which tissues is CFHR2 expressed? but you can also use the filters to find gene expression in other conditions, such as in which mouse cell types is CFHR2 expressed?

If you search for a particular condition (e. g. liver), in the Baseline Expression results you will see all organisms and experiments matching your search.

In Baseline expression results, how are baseline expression levels represented?

In Baseline expression results you will see one heatmap per species. Each heatmap shows, for each species, all conditions (columns) and all experiments (rows) that matched your search. Expression levels are displayed in the heatmap in five different colours:

  1. Grey box: expression level is below cutoff (0.5 FPKM or 0.5 TPM)
  2. Light blue box: expression level is low (between 0.5 to 10 FPKM or 0.5 to 10 TPM)
  3. Medium blue box: expression level is medium (between 11 to 1000 FPKM or 11 to 1000 TPM)
  4. Dark blue box: expression level is high (more than 1000 FPKM or more than 1000 TPM)
  5. White box: there is no data available

How can I see the results for each experiment in Baseline expression results?

Each Baseline expression experiment in Expression Atlas has its own Experiment page, e.g. Strand-specific RNA-seq of 13 human tissues from Michael Snyder's lab for the ENCODE project where you can see a heatmap showing the 50 most specifically expressed genes across all conditions studied.

You can further refine the query by narrowing the search to a particular gene (e.g. CTRB1), or gene sets (e.g. CELA3A, CELA3B, CTRB1, CTRB2, PRSS1, PRSS2), or by limiting which organism parts are searched over (e.g. genes specifically expressed in pancreas).

In the experiment page, how are baseline expression levels represented?

In the Experiment page, e.g. Transcription profiling by high throughput sequencing of different potato tissues (genotype RH89-039-16) expression levels are represented in one heatmap that shows gene expression levels for the 50 most specifically expressed genes (rows) across all conditions studied in the experiment selected (columns).

Expression levels are displayed in the heatmap by colour intensity, according to the gradient bar above the heatmap. Hover the mouse above a cell to show a tooltip with the numerical values corresponding to each colour.

Can I download the baseline expression results?

Yes, click on the Download All Results button button above the heatmap to download the data corresponding to your query. For example, if you select 'flower' in the Organism part box, use the Download all results button to download expression data for the subset of genes specifically expressed in flower in tab-delimited format with no ordering.

On the other hand, by clicking on the Download button button in the top left corner of the Experiment page you will download expression data for all genes and all conditions studied in the experiment.

Can I view the results in the Ensembl browser?

Yes, you can. From the heatmap of the Experiment page, e.g. Baseline expression from transcriptional profiling of zebrafish developmental stages just select a gene (e.g. HIST2H4A) and a condition, developmental stage in that particular experiment (e.g. gastrula 75%-epiboly) and click on the Ensembl Genome Browser button in the left of the heatmap. You will see gene expression value for gene HIST2H4A in developmental stage gastrula 75%-epiboly in the context of the genomic location of HIST2H4A.

For plant experiments, e.g. Transcriptomes for hybrids (F1s) between 18 Arabidopsis thaliana parents of the Multiparent Advanced Generation Inter-Cross (MAGIC) genetic mapping resource you can also use the Gramene Genome Browser button to see, for example, gene expression value for gene EPR1 in ecotype Sf-2 x Can-0 in the context of the genomic location of EPR1

For experiments performed in Caenorhabditis elegans or in Schistosoma mansoni such as RNA-Seq of Schistosoma mansoni (flatworms) larva and adult individuals at different life-stages you can see the results using the button.

How are similarly expressed genes computed?

Similarly expressed genes across conditions in baseline experiments (e.g. tissues, developmental stages) are shown where available. They are computed on per-experiment basis, for experiments with three or more conditions. The method for estimating them involves two steps. The first one involves k-means clustering (for all possible values of k, capped to 100) of the expression of each gene across (e.g.) tissues. The second step compares the clusters on a gene-to-gene basis and outputs a ranked list of genes with decreasing similarity of expression patterns for each gene. The top 50 similarly expressed genes can be explored on our interface. Lowly expressed genes are filtered out from the calculation. The method has been implemented within the Bioconductor package ClusterSeq.

When I search in Expression Atlas, what do the Differential Expression results show?

If you search for a particular gene (e.g. CFHR2), in the Differential expression results you will see all comparisons in all species in which CFHR2 is differentially expressed (absolute value of log2 fold-change > 1 and adjusted p-value < 0.05). Comparisons in which gene CFHR2 is differentially expressed are ordered so the one with the largest absolute value of log2 fold-change is at the top. If gene CFHR2 has identical log2 fold-change in several comparisons, then the one with the lower adjusted p-value goes first.

If you search for a particular condition (e. g. liver), in the Differential expression results you will see all comparisons in all species that matched your search. When several genes show differential expression you will see one row per each gene and comparison. Comparisons are ordered following the same criteria as explained before.

In both cases, you can narrow down the results displayed using the filters in the left where you can select a particular species (e.g. Mus musculus), experimental variable (e.g. compound) to see which comparisons involve the treatment of liver with a compound in mouse.

Click on the Comparison name (e.g. 'CCl4; 1.6 gram per kilogram' at '24 hour' vs 'none' at '0 hour') to find out more information about a particular gene-comparison combination. In microarray experiments, some genes may be targeted by more than one probe set. The table in the Differential expression results shows the values for the probe set with the largest absolute log2 fold-change. You can see details of all probe sets for a given gene on the Experiment page. For example, gene Car3 is represented by 4 probe sets in the experiment Transcriptional responses in liver upon acute CCl4 intoxication.

What is a "design element"?

On microarray experiment pages, you will see the design element name alongside the gene name. A design element is also known as a probe or probe set. This is the oligonucleotide probe (or group thereof) on the microarray that targets that gene.

What is a "comparison"?

A comparison is where two groups of samples are compared in a differential expression experiment. An example of a comparison is 'breast cancer' vs. 'normal'.

For each gene, the mean expression level of the test group (e.g. breast cancer) is compared with the mean expression level of the reference group (e.normal), and a statistical test is performed to decide whether the two means are significantly different.

What do the red and blue colours mean in the Differential expression results?

Let's search for the condition breast cancer in a differential context. A red box indicates that the gene is up-regulated in the test condition while a blue box means that the gene is down-regulated in the test condition. The colour intensity of filled boxes in the table represents how large the log2 fold-change is for each gene. The larger the log2 fold-change, the more intense the red or blue colour.

The two bars above the Differential expression results table show the red and blue colour intensities for the top 50 genes shown on the page. The colour intensities represent the log2 fold-changes for the genes shown.

How do I see the actual log2 fold-changes in the Differential expression results table?

Click on the Display Log2 Fold-Change button button to the top left corner of the table to see the numerical values of the log2 fold-change per each gene in each comparison.

Where do the p-values come from?

In a microarray experiment, each gene's mean expression level in the test group is compared with its mean expression level in the reference group using a moderated t-test. This is done using the limma package from Bioconductor.

In an RNA-seq experiment, each gene's mean expression level in the test group is compared with its mean expression level in the reference group using a conditioned test based on the negative binomial distribution, analogous to Fisher's exact test. This is done using the DESeq package in Bioconductor.

Because the same test is done on thousands of genes at once, p-values are adjusted for multiple testing using the Benjamini and Hochberg (1995) false discovery rate (FDR) correction. This is what is meant by adjusted p-value.

How are the gene set enrichment plots created?

Gene set enrichment analysis is performed using the Piano package from Bioconductor. For each comparison, enrichment of terms from GO, InterPro, and Reactome is tested for within the set of differentially expressed genes, using a variation on Fisher's exact test. Gene set enrichment plots are only shown when statistically significant enrichment of terms was detected. This means that for some experiments, the menu will not display plots for all three of the aforementioned resources.

Can I download the differential expression results?

Yes, click on the Download All Results button button above the heatmap to download the data corresponding to your query. For example, in Transcription profiling by array of pancreatic islets from Sgpp2 knockout mice after high fat diet if you select just one comparison, e.g. 'high fat diet' vs 'normal' in 'Sgpp2 knockout' from the Comparison box and click Apply, by clicking on the Download all results button you will download expression data for the subset of genes differentially expressed in that particular comparison in tab-delimited format with no ordering.

On the other hand, by clicking on the Download button button in the top left corner of the Experiment page you will download expression data for all genes in all comparisons studied in the experiment.

Jump to...

General
Searching
Results