How to get data from ArrayExpress

Downloading data

For each experiment in ArrayExpress, MAGE-TAB files describing the experiment and the associated results are available for download. MAGE-TAB is a simple tab-delimited format for sharing functional genomics data. In our example we want to download all data files for the experiment E-MEXP-3431 (Figure 9) as described in the steps below.

Figure 9 Searching ArrayExpress

 

Steps


  1. Open the ArrayExpress homepage in a new window
  2. Click on 'Experiments' [A]
  3. In the experiment search box [B] type "chronic myelogenous leukemia" and select the matching term from the suggestions. Click 'Search'.
  4. From the filter drop down menus [C] select Homo sapiens. Click 'Filter'.
  5. Click on the E-MEXP-3431 [D].

 

 

 


File types in experiment directories

There are four types of MAGE-TAB file that are used to capture information about a functional genomics experiment:

Each file describes a specific aspect of the selected experiment and is needed to understand what the experiment studied, how it was carried out, the results obtained and how they can be interpreted. Let's consider the file types associated with the experiment E-MEXP-3431 (Figure 10).

Figure 10 Files available

 

[A] Investigation description.The file with the '.idf.txt' extension contains generic information about the experiment including title, description, submitter contact details and protocols.

Sample and Data Relationship. The file with the '.sdrf.txt' extension consists of a table listing all the samples analysed in the experiment, their biological characteristics and the relationship between samples and data files.

[B] Data archives. These archives contain the results obtained from the experiment as raw and processed data. Both raw and processed data files are stored as .zip archives. Large datasets are split into several archives that are numbered sequentially. Archives can be downloaded just by clicking on the archive name.

[C] Array design file describes how the array was manufactured and what was printed/synthesised at each position on the array. It provides the array-level annotation for the experiment, relating the row-level identifiers in the data files to biological sequence annotation

 

Information

Be aware that MAGE-TAB file associated with one particular experiment can also be downloaded directly from the ArrayExpress FTP site or programmatically.

Investigation Description Format file

To start lets examine the Investigation Description Format (IDF) file which gives an overview of the experiment, including the experimental variables (factors), quality control strategy, contact details, publication information and protocols. You'll notice that a lot of information displayed in the interface for E-MEXP-3431, such as description, contact details etc., is also stored in this file (Figure 11).

Figure 11 E-MEXP-3431 IDF

 

Notes

 

[A] General information about the experiment including title, a brief free-text description and experiment design information (cell_cycle_design etc.)

[B] The Experimental Factor Name is the name given to each experimental factor. These represent the principal variables under investigation e.g. genotype.

[C] Submitter contact details including email address of the submitter.

[D] Protocol information describing the experiment sample growth/treatment and processing steps.

[E] Included in the IDF file is an (optional) list of sources from which controlled vocabulary terms may be taken. These terms can appear elsewhere in the IDF or Sample and Data Relationship Format (SDRF) file. The Term Source Name field points to the source of the terms used (e.g. ArrayExpress). This name will appear in the corresponding Term Source REF fields of the SDRF. The Term Source File contains a filename or valid URL at which the Term Source may be accessed.

[F] The IDF and SDRF are in separate text files. As a result the IDF contains a pointer to the associated SDRF file


 

Sample and Data Relationship Format file

The Sample and Data Relationship Format (SDRF) file describes the sample characteristics and the relationship between samples, arrays, data files etc. The information in the SDRF is organised so that it follows the natural flow of a functional genomics experiment. It begins by describing the experiment samples and finishes with the names of the data files generated from the analysis of the experiment results. For single-channel data, such as Affymetrix experiments, one row in the SDRF is equal to one hybridisation. For two-channel data one row is equal to one channel. Situations such as pooling of samples to create a common reference, technical replicates in which an extract is hybridized more than one time, or an extract is split and labeled with more than one dye can also occur.

Lets take a closer look at the various parts of the SDRF for E-MEXP-3431 which is a single channel Affymetrix experiment (Figure 12).

Figure 12 E-MEXP-3431 SDRF

 

Notes

 [A] The first part of the SDRF describes the characteristics of the samples.

[B] The next section of the SDRF shows the processing steps applied to create the extracts and labeled extracts and which labeled extracts were used in each hybridization.

[C] The next section of the SDRF lists which array design was used for each hybridization and which data files go with each hybridization. The raw data file names are listed under the Array Data File column. The FTP location for the archive containing these files is also included in the Comment [ArrayExpress FTP file] column.  The processed data files, which have been derived from the raw data, are listed in the Derived Array Data File column, e.g. CHP files. This column is used when the processed data file corresponds to a single hybridisation.

[D] The final section of the SDRF lists the experimental factors (variables) associated with each hybridization. The factor values may be sample characteristics (such as genotype) or may be an external treatment (such as growth in low oxygen conditions). It also lists another type of processed data file in the Derived Array Data Matrix File column. This column is used when the data is processed and the file produced contains data from several hybridisations.
 

 

Array Design File

For microarray experiments a link to the array will be provided, from which you can download the complete array design file (ADF) (Figure 13). This file describes how a array was manufactured and what was printed/synthesised at each position on the array.

Figure 13 Array design file

 

Notes[A] The array Affymetrix GeneChip Human Genome HG-U133A [HG-U133A] was used in the experiment E-MEXP-3431. A-AFFY-33 is the ArrayExpress accession for this array design.

[B] The array link takes you to the platform page from which you can access the ADF (.adf.txt).

[C] The header section of the ADF contains the array name, description etc.

[D] The ADF contains identifiers for the sequences spotted on the array. These identifier values are used to link array annotation to measurement values in data files (Figure 14). For each identifier database entries, or actual sequences, describing the sequences on the array are provided [E].

 

 

 

 

Data Files

There are two types of data files that are associated with experiments: raw and processed and these are found in the E-XXXX-n.raw.1.zip and E-XXXX-n.processed.1.zip archives. Large datasets are split into several raw and processed file archives and are numbered sequentially e.g. E-MEXP-3431.processed.1.zip, E-MEXP-3431.processed.2.zip, E-MEXP-3431.processed.3.zip, E-MEXP-3431.processed.4.zip. The names of the files in these archives will correspond to the names of the files listed in the SDRF.

For some experiments the processed data file will be in the format of a MAGE-TAB data matrix. This file contains data from more than one hybridization, scan or normalization, in a single data file. This format allows data columns to be mapped to rows in the SDRF file (Figure 14).

Figure 14 MAGE-TAB Data Matrix and its links to the SDRF and ADF

 

Notes

[A] A MAGE-TAB data matrix has two header rows. The first header row, Hybridization REF, contains references to the hybridisations. These names must match the names supplied in the Hybridization Name or Assay Name column of the SDRF [B]. The second header row, Reporter REF or CompositeElement REF, lists the quantitation types for each column (e.g. RMA Normalized, log2 ratio).

[C] The first column contains identifiers from the array design used [D].