spacer
spacer

Data File Formats for ArrayExpress Submission

In this help page:

 

Overview

This diagram shows how the raw data files, normalized data files and final gene expression matrix are related in an ArrayExpress experiment submission. Normalized data can be submitted as one file per hyb, or in a final gene expression matrix.


Data files flow image

 

Supported file types by submission tool

 

Submission tool Data Type File format Quantity supported File types NOT supported
MIAMExpress Raw

CEL, gpr or .txt

1 per hybridization (REQUIRED)

Illumina, Nimblegen or ImaGene with 2 raw files per channel

Normalized CHP, gpr or .txt 1 per hybridization
Combined data file .txt 1 per experiment
 

Tab2MAGE/

MAGE-TAB

Raw CEL, gpr or .txt

1 or more per hybridization

(1 file REQUIRED)

High-throughput sequence data (Solexa and 454) can only be submitted through MAGE-TAB
Normalized CHP, gpr or .txt 1 or more per hybridization
Combined data file .txt 1 or more per experiment

 

Raw data is required for all submissions. The normalized files and combined data file (final gene expression matrix) are optional but you should provide at least one of these file types for your submission to be fully MIAME compliant.

 

Raw data files

DO NOT EDIT YOUR RAW DATA FILES.

For Affymetrix the raw data is the CEL file. For other platforms the raw data is the file which contains the signal intensities, background intensities, etc, for every spot on the array, e.g. GenePix .gpr file, Agilent Feature Extraction software .txt file.

Our submission tools support raw data from the software listed below:

  • Affymetrix (CEL file)
  • GenePix
  • Agilent
  • ScanAlyze
  • ScanArray
  • Arrayvision
  • Spotfinder
  • BlueFuse
  • UCSF Spot
  • Applied Biosystems
  • CodeLink
  • ImaGene (2 raw files per hybridization supported by Tab2MAGE/MAGE-TAB only)
  • NimbleScan (2 raw files per hybridization supported by Tab2MAGE/MAGE-TAB only)
  • Illumina (supported by Tab2MAGE/MAGE-TAB only) - for Illumina we require data for every probe on the array. You do not need to provide data for every spot. For more information on supported Illumina data files see the Tab2MAGE documentation.

If your data is not from one of these programs then we will probably be able to convert it to a format that is supported. Please note that we are only able to handle files that can be read by a standard text editor (with the exception of binary CEL files). If you are not sure what files to provide you can email us at either attaching an example file to the email or FTP the example file to us.

 

Normalized data files

Applying a normalization algorithm to a raw data file, for example print tip normalization, produces a normalized file. Submitted normalized data files must contain data from a single hybridization only. If your normalization procedure creates a file containing data from all your hybs then you can submit this as a final gene expression matrix (FGEM).

A normalization protocol should be submitted along with your normalized data files. Please make sure your normalization protocol contains enough information to allow users to understand what the data in your normalized files means. Be precise when describing how the data was calculated, e.g. 'log ratio' is not enough information for MIAME compliance, we need to know what kind of log it is (log2, log10, loge etc).

Affymetrix per-hyb normalized data

For Affymetrix submissions you can submit the CHP file, or a text file from some other software as per-hyb normalized data. Each line in the text files must correspond to an Affymetrix probe set, the probe set ID (called a CompositeSequence Identifier by us) must be provided in the first column of the file, e.g.

-Example of a single hyb GC-RMA normalized data file

Other per-hyb normalized data

The per-hyb normalized data can contain either:

 

  • Feature coordinates for reporters on the array. They must be provided in gpr format, or in the MetaColumn, MetaRow, Column, Row format which is used in the array design file (ADF). There is more information on MetaColumn, MetaRow, Column, Row coordinates in the array submission help.
  • Reporter or CompositeSequence identifiers that are on the array, e.g. for data that has been averaged over duplicate spots .In these files the first column must contain the identifier of the Reporter or CompositeSequence that the data corresponds to - these identifiers are used to link the data to the array annotation provided in the array design file (ADF). There is more information about Reporters and CompositeSequences in the array submission help.

-Example of a lowess normalized data file with MetaColumn, MetaRow, Column, Row coordinates
-Example of a median normalized data file with GenePix gpr coordinates

-Example of a median normalized data file with Reporter Identifiers

 

 

Final gene expression matrix (FGEM)

A final gene expression matrix (FGEM) or combined data file is a file containing data from several hybridizations. It can be created by any data processing or spreadsheet software but must be saved as a tab delimited .txt file. MIAMExpress allows you to upload only 1 FGEM per experiment. Tab2MAGE and MAGE-TAB allow you to upload multiple FGEMs per experiment.

The creation of your FGEM must be described in your transformation protocol. Be precise when describing how the data was calculated, e.g. 'log ratio' is not enough information for MIAME compliance, we need to know what kind of log it is (log2, log10, loge etc).

The format of the FGEM is as follows:

 

  • each line corresponds to an array element, either a Reporter, CompositeSequence or Affymetrix probe set. The first column of the file must contain the identifier of the array element as used in the array design file (ADF).

 

  • each column corresponds to your calculated value (called a QuantitationType in ArrayExpress) e.g. Average log ratio, for a single hybridization or group of hybridizations. The name of each hybridization must be listed inside brackets after the name of the calculated value. Hybridization names must match the names given to your hybs in MIAMExpress or Tab2MAGE so that we can link this data back to the relevant sample descriptions.  MAGE-TAB has a slightly different format for final gene expression matrices which is described on this page - MAGE-TAB data matrix.

 

Explanation of columns in an FGEM:


Example FGEM explanation

Examples for download (files truncated for faster download):
-Example of FGEM containing RMA normalized Affy data
-Example of FGEM containing dChip normalized Affy data (each column contains average data from 2 replicate hybs)
-Example of FGEM containing mean log2 ratio, standard deviation and standard error for 2 sets of 3 replicates

 

 

Sending files by FTP

The email account cannot receive large attachments so if you need to send several files to us prior to submission (e.g. for us to check they are in a suitable format) then you can put them on our FTP site. After putting them on the FTP site email and tell us the name of the file transferred.

Data files placed on the FTP site are NOT submitted to ArrayExpress. To submit files to ArrayExpress you must upload them using either the MIAMExpress or Tab2MAGE submission tools.

To transfer files using Windows, open Windows Explorer and enter ftp://aexpress@ftp1.ebi.ac.uk/. You will be asked to login. The login and password are aexpress. After logging in you can drag and drop your files across. Note: you will not be able to see any files or directories already on the FTP site.

FTP example

To transfer files using Unix, a Mac terminal window or the windows command prompt connect to the FTP server using the command ftp ftp1.ebi.ac.uk. Username and password are: aexpress. Use the put command to place your file (or mput for multiple files) into the default directory. Please ensure that you use unique file names. To exit FTP, type quit. On exiting you will get a message printed to screen to tell you whether your transfer was successful. Note: you will not be able to list the files in the directory or download files from the FTP site to your directory.

.

Any further questions, please see our FAQ.

spacer