Course progress: 0%

Identifying targets for cancer using gene expression profiles

Gene expression is the transcription of DNA sequences into RNA sequences by the biochemical machinery inside the cell. For protein coding genes, following transcription the mRNA is then translated into amino acid sequences. Both RNA and protein expression levels are useful parameters to be investigated when identifying and prioritising targets in drug discovery.

In this exercise, we will be working on expression data derived from transcriptomic experiments which is available as part of the Pan-cancer analysis project.

Dataset Description

It contains five types of cancer:

LUAD: lung adenocarcinoma
BRCA: breast carcinoma
KIRC: kidney renal clear-cell carcinoma
COAD: colon adenocarcinoma
PRAD: prostate adenocarcinoma

The dataset consists of two files:

Data.csv is the features matrix. Each row corresponds to a sample, and each column/feature corresponds to a gene. The cells contain gene expression data per sample
labels.csv contain the labels, which represent the cancer type for each sample.

These files were merged into a single file, which will be used in this exercise. We will explain below how we preprocessed the file ready for use in this exercise in case you would like to recreate this, however you can download the preprocessed file here.

How are the files merged? (If you downloaded the preprocessed file you do not need to do this step, it is for your information only).

Load Data.csv file into WEKA then export it to arrf format:
1. Run WEKA as explained in the previous section.
2. Click on “Explorer”
3. Click Open File
4. In “Files of Type:” select .csv
5. Select file: Data.csv
6. Click Save button to save the data in arff format
7. Name the file: Data.arff
Load labels.csv file into WEKA then export it to arff format as explained above. Name the file labels.arff
Using WEKA CLF interface, type the command: java weka.core.Instances merge Data.arff labels.arff > merged.arff

Note that there are various ways and tools for appending the labels to the data file and converting from csv to arff format. The above is one way to do it using WEKA, but you can use your preferred tool or write your own script to perform the same task.

Machine learning in drug discovery

Identifying targets for cancer using gene expression profiles

Dataset Description

Congratulations!