- Course overview
- Search within this course
- What is machine learning?
- ML in drug discovery: why now?
- ML in the drug discovery pipeline
- Getting started in ML using WEKA
- Identifying targets for cancer using gene expression profiles
- Other tools utilising ML or NLP for drug discovery
- Summary
- Your feedback
- Learn more
- References
Loading data
Dataset description
We’ll be using the “00_logFCs.tsv” file, which can be downloaded from this link on the Sanger website (outside of EMBL-EBI). This file is one of the eight files in essentiality_matrices.zip and contains a data matrix with the depletion log fold changes for 17,995 genes scored for each of the 325 cell lines [18]. All but one cell (named HT29v1.1) are cancer cell lines [19] in this matrix.
Note: The essentiality_matrices.zip is a rather large file (241,5 MB), and may take a while to download and unzip. The “00_logFCs.tsv” file is 104,9 MB.
To load the data file it first needs to be converted to a CSV file. This can be done by opening the TSV file in a program such as Excel and saving the file as a CSV file. We have also made the CSV file available for direct download: download CSV file.
Model development
- Run WEKA as explained in the previous section.
- Click on “Explorer”
- Click Open File
- In “Files of Type:” CSV data files (*.csv)
- Check the box ‘Invoke options dialog’
- Select file: 00_logFCs.csv
Open the file using the settings shown in Figure 9.

Click “OK”, to instruct WEKA how to load the data (Figure 10).
- In the fieldSeparator, leave as “,’
(the comma symbol)
- In the stringAttributes field, enter 1
(to instruct WEKA that the first column in the dataset is a String, which represent the GeneID)
- In the numbericalAttributes field, enter: 2-last
(to instruct WEKA that the second until the last columns in the datasets are numbers)
- Click OK

The dataset has now successfully been loaded in WEKA (Figure 11).
