What format should my raw and processed data be in?
Data files are uploaded in zipped archives. Raw files should be provided unedited in their native format. Native format is the file format that an application, such as the software used by a scanner (e.g. gpr files from GenePix) or a sequencing instrument (e.g. BAM format), produces. Ideally, you should be able to use the raw unprocessed data files from any of the supported software types without having to edit the data files in any way. Our checking scripts uses the column headings within the file to identify which kind of file it is dealing with, and what the quantitation types are. Processed/normalised data can also be supplied in native format.
The following formats are supported:
I want to submit a MAGE-TAB file
Raw (unprocessed) data files
The following list gives a brief overview of how we recognize different file formats. In each case, the data file row containing the column headings is identified by matching it to these sets of known column headings.
Generic
MetaColumn/MetaRow format files are recognized using the following column headings:
| MetaColumn | MetaRow | Column | Row |
Affymetrix
Our checking scripts recognize and parse CEL and EXP files using both the old GDAC formats and the newer GCOS/XDA formats. These file formats are detected using the Affymetrix data file parser incorporated into the Tab2MAGE package. See below for notes on Affymetrix normalized data file formats.
GenePix
GenePix format files are recognized using the following column headings:
| Block | Column | Row | X | Y |
Agilent
A file containing these headings is recognized as an Agilent format file:
| Row | Col | PositionX | PositionY |
ScanAlyze
The following column headings are recognized as being from a ScanAlyze format file:
| GRID | COL | ROW | LEFT | TOP | RIGHT | BOT |
ScanArray/QuantArray
ScanArray Express files are recognized from the following headings:
| Array Column | Array Row | Spot Column | Spot Row | X | Y |
while the older QuantArray format has these headings:
| Array Column | Array Row | Column | Row |
ArrayVision
The following column headings are recognized as indicating an ArrayVision format file:
| Primary | Secondary |
Newer "lg2" ArrayVision files are identified by the following column headings:
| Spot labels |
Spotfinder
Spotfinder files are recognized by the following column headings:
| MC | MR | SC | SR |
BlueFuse
A file containing the following headings is recognized as a BlueFuse file:
| COL | ROW | SUBGRIDCOL | SUBGRIDROW |
UCSF Spot
UCSF Spot files are recognized by the following column headings:
| Arr-colx | Arr-coly | Spot-colx | Spot-coly |
NimbleScan
NimbleScan files (Feature, Probe and Pair) all contain the following headings:
| PROBE_ID | X | Y |
Applied Biosystems
Files generated by Applied Biosystems software have the following headings:
| Probe_ID | Gene_ID |
CodeLink
CodeLink Expression Analysis files are identified using the following:
| Logical_row | Logical_col | Center_X | Center_Y |
ImaGene
ImaGene files are recognized using the following columns:
| Meta Column | Meta Row | Column | Row | Field | Gene ID |
The ImaGene 3.0 format is also supported:
| Meta_col | Meta_row | Sub_col | Sub_row | Name | Selected |
CSIRO Spot
CSIRO Spot files contain the following columns:
| grid_c | grid_r | spot_c | spot_r | indexs |
Normalized data files
Normalized data files may be submitted in any of the above formats. In addition, files may be parsed using a number of special column headings which can be used to designate a column containing reporter or composite sequence identifiers:
Generic
If you have normalized data mapped to the identifiers used in your array design, you can simply use a single column containing those identifiers. MAGE-TAB supports the use of either Reporter Identifiers or CompositeSequence Identifiers for this purpose. Please see these ADF help notes for a discussion on these identifier types. Thus, either of the following sets of column headers may be used:
| Reporter Identifier | <QT1> | <QT2> | <QT3> |
| CompositeSequence Identifier | <QT1> | <QT2> | <QT3> |
where <QT1>, <QT2> etc. are the names of your quantitation types.
Affymetrix normalized data
MAGE-TAB recognizes and parses CHP files using both the old GDAC formats and the newer GCOS/XDA formats. In addition, Affymetrix data normalized by non-Affymetrix methods (e.g. RMA normalization) can be parsed. Either CompositeSequence identifiers (see example above) or either of the following sets of column headers may be used:
| ProbeSet ID | <QT1> | <QT2> | <QT3> |
| ProbeSet Name | <QT1> | <QT2> | <QT3> |
Again, <QT1>, <QT2> etc. are the names of your quantitation types.
Please note: If you have normalised data file(s) that are not in one of the formats described above you can still submit this data. Ensure you include a ‘normalization data transformation protocol’ in your IDF describing how these files were created and what the various columns in the file(s) represent.
Data Matrix
If you wish to represent data from more than one assay, scan or normalization in a single data file, you will need to reformat it as a MAGE-TAB Data Matrix. This is a simplified format which allows data columns to be mapped to rows in the SDRF file. The first header line of a Data Matrix file describes this mapping, and the second lists the quantitation types for each column (e.g. "log2 ratio"). The first column is used to map the data rows to identifiers from the array design used. Examples are shown here:
Example non-Affymetrix data matrix
In this example, five hybridizations from the 'Assay Name' column of the SDRF file [A] are being mapped to log2 ratio values. Each row of data is mapped to a Reporter Identifier [B] defined by the array design (itself indicated in the Array Design REF column defined in the SDRF).
Example Affymetrix data matrix
In this example, two hybridizations from the 'Assay Name' column of the SDRF file [A] are being mapped to data with two different quantitation types (CELIntensity, CELStdev). Each row of data is mapped to a CompositeElement Identifier [B] defined by the array design.
There are some limitations imposed by ArrayExpress when submitting data in this format. Firstly, each data matrix should correspond to assays performed on a single array design. Experiments using multiple array designs should use one data matrix per design. Secondly, we rely on there being an ordered and regular organisation of the columns: first by assay, and then by quantitation type:
Correct
| Hybridization REF | Hyb1 | Hyb1 | Hyb2 | Hyb2 |
| Reporter REF | QT X | QT Y | QT X | QT Y |
Wrong
| Hybridization REF | Hyb1 | Hyb2 | Hyb1 | Hyb2 |
| Reporter REF | QT X | QT X | QT Y | QT Y |
Please note: If you have normalised data file(s) that are not in one of the formats described above you can still submit this data. Ensure you include a ‘normalization data transformation protocol’ in your IDF describing how these files were created and what the various columns in the file(s) represent.
Illumina
The Illumina BeadStudio software (see www.illumina.com) generates data files in several closely related formats. MAGE-TAB only supports such files reporting Probe-level (PROBE_ID) data in tab-delimited format. These files are characterised by having PROBE_ID [A] as the first column, with subsequent column headers following the pattern "Assay name.QuantitationType name" [B].
Illumina files may contain data from single or multiple assays. Data is supported in "Array Data File" or "Derived Array Data Matrix File" columns of the MAGE-TAB spreadsheet. The file name should be included in all of the rows corresponding to the assays it covers.
