What format should my raw and processed data be in?

Data files are uploaded in zipped archives. Raw files should be provided unedited in their native format. Native format is the file format that an application, such as the software used by a scanner (e.g. gpr files from GenePix) or a sequencing instrument (e.g. BAM format), produces. Ideally, you should be able to use the raw unprocessed data files from any of the supported software types without having to edit the data files in any way. Our checking scripts uses the column headings within the file to identify which kind of file it is dealing with, and what the quantitation types are. Processed/normalised data can also be supplied in native format.

The following formats are supported:

I want to submit a MAGE-TAB file

Raw (unprocessed) data files

The following list gives a brief overview of how we recognize different file formats. In each case, the data file row containing the column headings is identified by matching it to these sets of known column headings.

Generic

MetaColumn/MetaRow format files are recognized using the following column headings:

MetaColumn MetaRow Column Row

Affymetrix

Our checking scripts recognize and parse CEL and EXP files using both the old GDAC formats and the newer GCOS/XDA formats. These file formats are detected using the Affymetrix data file parser incorporated into the Tab2MAGE package. See below for notes on Affymetrix normalized data file formats.

GenePix

GenePix format files are recognized using the following column headings:

Block Column Row X Y

Agilent

A file containing these headings is recognized as an Agilent format file:

Row Col PositionX PositionY

ScanAlyze

The following column headings are recognized as being from a ScanAlyze format file:

GRID COL ROW LEFT TOP RIGHT BOT

ScanArray/QuantArray

ScanArray Express files are recognized from the following headings:

Array Column Array Row Spot Column Spot Row X Y

while the older QuantArray format has these headings:

Array Column Array Row Column Row

ArrayVision

The following column headings are recognized as indicating an ArrayVision format file:

Primary Secondary

Newer "lg2" ArrayVision files are identified by the following column headings:

Spot labels

Spotfinder

Spotfinder files are recognized by the following column headings:

MC MR SC SR

BlueFuse

A file containing the following headings is recognized as a BlueFuse file:

COL ROW SUBGRIDCOL SUBGRIDROW

UCSF Spot

UCSF Spot files are recognized by the following column headings:

Arr-colx Arr-coly Spot-colx Spot-coly

NimbleScan

NimbleScan files (Feature, Probe and Pair) all contain the following headings:

PROBE_ID X Y

Applied Biosystems

Files generated by Applied Biosystems software have the following headings:

Probe_ID Gene_ID

CodeLink

CodeLink Expression Analysis files are identified using the following:

Logical_row Logical_col Center_X Center_Y

ImaGene

ImaGene files are recognized using the following columns:

Meta Column Meta Row Column Row Field Gene ID

The ImaGene 3.0 format is also supported:

Meta_col Meta_row Sub_col Sub_row Name Selected

CSIRO Spot

CSIRO Spot files contain the following columns:

grid_c grid_r spot_c spot_r indexs

Top


Normalized data files

Normalized data files may be submitted in any of the above formats. In addition, files may be parsed using a number of special column headings which can be used to designate a column containing reporter or composite sequence identifiers:

Generic

If you have normalized data mapped to the identifiers used in your array design, you can simply use a single column containing those identifiers. MAGE-TAB supports the use of either Reporter Identifiers or CompositeSequence Identifiers for this purpose. Please see these ADF help notes for a discussion on these identifier types. Thus, either of the following sets of column headers may be used:

Reporter Identifier <QT1> <QT2> <QT3>

CompositeSequence Identifier <QT1> <QT2> <QT3>

where <QT1>, <QT2> etc. are the names of your quantitation types.

Affymetrix normalized data

MAGE-TAB recognizes and parses CHP files using both the old GDAC formats and the newer GCOS/XDA formats. In addition, Affymetrix data normalized by non-Affymetrix methods (e.g. RMA normalization) can be parsed. Either CompositeSequence identifiers (see example above) or either of the following sets of column headers may be used:

ProbeSet ID <QT1> <QT2> <QT3>

ProbeSet Name <QT1> <QT2> <QT3>

Again, <QT1>, <QT2> etc. are the names of your quantitation types.

Please note: If you have normalised data file(s) that are not in one of the formats described above you can still submit this data. Ensure you include a ‘normalization data transformation protocol’ in your IDF describing how these files were created and what the various columns in the file(s) represent. Please, if possible, submit the file(s) as a .txt file as we cannot process certain file types such as Excel (.xls or .xlsx) format.

Top

Data Matrix

If you wish to represent data from more than one assay, scan or normalization in a single data file, you will need to reformat it as a MAGE-TAB Data Matrix. This is a simplified format which allows data columns to be mapped to rows in the SDRF file. The first header line of a Data Matrix file describes this mapping, and the second lists the quantitation types for each column (e.g. "log2 ratio"). The first column is used to map the data rows to identifiers from the array design used. Examples are shown here:

Example non-Affymetrix data matrix

non Affymetrix data matrix

In this example, five hybridizations from the 'Assay Name' column of the SDRF file [A] are being mapped to log2 ratio values. Each row of data is mapped to a Reporter Identifier [B] defined by the array design (itself indicated in the Array Design REF column defined in the SDRF).

Example Affymetrix data matrix

Affymetrix data matrix

In this example, two hybridizations from the 'Assay Name' column of the SDRF file [A] are being mapped to data with two different quantitation types (CELIntensity, CELStdev). Each row of data is mapped to a CompositeElement Identifier [B] defined by the array design.

There are some limitations imposed by ArrayExpress when submitting data in this format. Firstly, each data matrix should correspond to assays performed on a single array design. Experiments using multiple array designs should use one data matrix per design. Secondly, we rely on there being an ordered and regular organisation of the columns: first by assay, and then by quantitation type:

Correct

Hybridization REF Hyb1 Hyb1 Hyb2 Hyb2
Reporter REF QT X QT Y QT X QT Y

Wrong

Hybridization REF Hyb1 Hyb2 Hyb1 Hyb2
Reporter REF QT X QT X QT Y QT Y

Please note: If you have normalised data file(s) that are not in one of the formats described above you can still submit this data. Ensure you include a ‘normalization data transformation protocol’ in your IDF describing how these files were created and what the various columns in the file(s) represent. Please, if possible, submit the file(s) as a .txt file as we cannot process certain file types such as Excel (.xls or .xlsx) format.

Top


Illumina

Several different types of files can be generated from Illumina arrays (see www.illumina.com). MAGE-TAB can accept Illumina iDAT files and those produced by Illumina's GenomeStudio.

Illumina files may contain data from single or multiple assays. Data is supported in "Array Data File" or "Derived Array Data Matrix File" columns of the MAGE-TAB spreadsheet. The file name should be included in all of the rows corresponding to the assays it covers.

Top