Identifying targets for cancer using gene expression profiles
Gene expression is the transcription of DNA sequences into RNA sequences by the biochemical machinery inside the cell. For protein coding genes, following transcription the mRNA is then translated into amino acid sequences. Both RNA and protein expression levels are useful parameters to be investigated when identifying and prioritising targets in drug discovery.
In this exercise, we will be working on expression data derived from transcriptomic experiments which is available as part of the Pan-cancer analysis project.
Dataset Description
It contains five types of cancer:
- LUAD: lung adenocarcinoma
- BRCA: breast carcinoma
- KIRC: kidney renal clear-cell carcinoma
- COAD: colon adenocarcinoma
- PRAD: prostate adenocarcinoma
The dataset consists of two files:
- Data.csv is the features matrix. Each row corresponds to a sample, and each column/feature corresponds to a gene. The cells contain gene expression data per sample
- labels.csv contain the labels, which represent the cancer type for each sample.
These files were merged into a single file, which will be used in this exercise. We will explain below how we preprocessed the file ready for use in this exercise in case you would like to recreate this, however you can download the preprocessed file here.
How are the files merged? (If you downloaded the preprocessed file you do not need to do this step, it is for your information only).
- Load Data.csv file into WEKA then export it to arrf format:
- Run WEKA as explained in the previous section.
- Click on “Explorer”
- Click Open File
- In “Files of Type:” select .csv
- Select file: Data.csv
- Click Save button to save the data in arff format
- Name the file: Data.arff
- Load labels.csv file into WEKA then export it to arff format as explained above. Name the file labels.arff
- Using WEKA CLF interface, type the command: java weka.core.Instances merge Data.arff labels.arff > merged.arff
Note that there are various ways and tools for appending the labels to the data file and converting from csv to arff format. The above is one way to do it using WEKA, but you can use your preferred tool or write your own script to perform the same task.