Feature extraction

Feature extraction is the process of converting the scanned image of the microarray into quantifiable (computable) values and annotating it with the gene IDs, sample names and other useful information (Figure 5) (4).

Feature extraction involves the conversion of the scanned microarray image to quantifiable values that are saved in binary (e.g. CEL) or text format
Figure 5 Feature extraction involves the conversion of the scanned microarray image to quantifiable values that are saved in binary (e.g. CEL) or text format.

This process is often performed using the software provided by the microarray manufacturer. The output of this process is raw (i.e. unprocessed) data files that can be in binary or text format (Table 1).

Table 1 Common microarray raw data file types.

Manufacturer Typical raw data format How to open / Analysis software examples
Affymetrix   .CEL (binary) R packages (affy, limma, oligo…)
Agilent feature extraction file (tab-delimited text file per hybridisation)

R packages (e.g. limma)

Spreadsheet software (Excel, OpenOffice, etc.)

GenePix (scanner) .gpr (tab-delimited text file per hybridisation) Spreadsheet software (Excel, OpenOffice, etc.)
Illumina .idat (binary) R packages (e.g. illuminaio)
txt (tab-delimited text matrix for all samples)

R packages (e.g. lumi)

Spreadsheet software (Excel, OpenOffice, etc.)

Nimblegen NimbleScan, .pair (tab-delimited text matrix for all samples) Spreadsheet software (Excel, OpenOffice, etc.)

After the feature extraction process, the data can be analysed. Array manufacturers often provide software to open and analyse their raw data files. These programs may not always be available, may become obsolete after a few years, or may not be flexible enough for your needs. There are several free software tools that are suitable for the downstream processing of microarray files. Examples are the Galaxy platform, GenePattern, GeneSpring (licence required) and the statistics software R.

The functional genomics team at EMBL-EBI uses the R packages ‘oligo‘, ‘limma‘ and ‘lumi’ (5) to analyse Affymetrix, Agilent and Illumina microarray data for the Expression Atlas.