Loading data

Dataset description

We’ll be using the “00_logFCs.tsv” file, which can be downloaded from this link on the Sanger website (outside of EMBL-EBI). This file is one of the eight files in essentiality_matrices.zip and contains a data matrix with the depletion log fold changes for 17,995 genes scored for each of the 325 cell lines [18]. All but one cell (named HT29v1.1) are cancer cell lines [19] in this matrix.

Note: The essentiality_matrices.zip is a rather large file (241,5 MB), and may take a while to download and unzip. The “00_logFCs.tsv” file is 104,9 MB.

To load the data file it first needs to be converted to a CSV file. This can be done by opening the TSV file in a program such as Excel and saving the file as a CSV file. We have also made the CSV file available for direct download: download CSV file.

Model development

  1. Run WEKA as explained in the previous section.
  2. Click on “Explorer”
  3. Click Open File
  4. In “Files of Type:” CSV data files (*.csv)
  5. Check the box ‘Invoke options dialog’
  6. Select file: 00_logFCs.csv

Open the file using the settings shown in Figure 9.

Figure 9 Invoke options dialog must be checked.

Click “OK”, to instruct WEKA how to load the data (Figure 10).

  1. In the fieldSeparator, leave as “,’

(the comma symbol)

  1. In the stringAttributes field, enter 1

(to instruct WEKA that the first column in the dataset is a String, which represent the GeneID)

  1. In the numbericalAttributes field, enter: 2-last

(to instruct WEKA that the second until the last columns in the datasets are numbers)

  1. Click OK
Figure 10 Provide WEKA with the details required to load the data.

The dataset has now successfully been loaded in WEKA (Figure 11).

Figure 11 The data has been loaded into WEKA.