- is a distributed computing package for Affymetrix microarray RMA pre-processing for the ArrayExpress R/Bioconductor Workbench
affyParaEBI (aPE) is an R/Bioconductor-based pipeline for fast pre-processing of Affymetrix chips. It allows users to quickly pre-process large amounts of CEL files into Biocondutor ExpressionSet objects that can be used for downstream analysis.
aPE works remotely on the EBI R cloud, and can be used to analyse local datasets or public datasets from the ArrayExpress Archive of Functional Genomics Data at the EBI.
Running aPE on the EBI R cloud#
- Choose a dataset to preprocess or upload your own data
aPE can be used on the EBI R-Cloud to pre-process private datasets by uploading its corresponding CEL files to your R/Bioconductor Workbench account. You can also pre-process public datasets available through ArrayExpress. You can use this interface to search for a dataset.
- Launch the R-Cloud Workbench, register and create a new project
The EBI R-Cloud is a new service at the EBI which allows R users to log in and run distributed computational jobs remotely on its powerful 64-bit linux cluster. This is available through a Java client called the ArrayExpress R/Bioconductor Workbench. Open the following address in a browser and follow the instructions on the page to download or launch the Workbench: http://www.ebi.ac.uk/tools/rcloud
When connecting for the first time you will be requested to register. You will need to set/provide a username, password and e-mail address to allow you to retrieve your long running projects the next time you log in. Once registered you can log in and create a new project.
Find your way around the workbench
- Run the whole pipeline
If your CEL files are in a folder 'td', running affyParaEBI within R is straightforward with the following simple code:
> library(affyParaEBI) > cluster <- makeCluster(10,type="RCLOUD") > e <- preproParaEBI(path=td, clust=cluster) > stopCluster(cluster); rm(cluster)
When the pipeline finishes, 'e' will contain an ExpressionSet object that can then be used for downstream analysis.
The pre-processing will take time depending on the size of the dataset and how many computing nodes were allocated with "makeCluster" (10 in the above example). Typically, 50 nodes can pre-process 1000 CEL files in about 20 minutes.
If you want to pre-process one or more experiments available in ArrayExpress, you can download the CEL files to a folder via ArrayExpress R/Bioconductor package:
> library(ArrayExpress) > library(affyParaEBI) > #1) Create a two-node cluster > cluster<-makeCluster(2,type="RCLOUD") > #2) Retrieve CEL files from experiment E-MEXP-328 > td<-tempdir() > emexp328.raw<-ArrayExpress(input = "E-MEXP- 328", path=td, save=TRUE) > #3) Replace low performance nodes > cluster<-clusterOptimization(cluster, subst=TRUE) > #4) Perform parallel preprocessing of CEL files > emexp328.proc<-preproParaEBI(path=td, clust=cluster) > #5) Clean up cluster nodes and CEL files > stopCluster(cluster); rm(cluster) > file.remove(list.files(td, full.names=TRUE))
Return to ArrayExpressHTS Help Topics