- Course overview
- Search within this course
- What is machine learning?
- ML in drug discovery: why now?
- ML in the drug discovery pipeline
- Getting started in ML using WEKA
- Identifying targets for cancer using gene expression profiles
- Other tools utilising ML or NLP for drug discovery
- Summary
- Your feedback
- Learn more
- References
Preprocessing data
1. Delete the Gene column (we don’t want the Gene ID to be used during clustering):
- Check the box next to Gene then click Remove (Figure 12).

2. WEKA automatically assumes that the last column is a Class rather than a feature. Therefore, the last column will not be used during clustering. To avoid this problem, we will create a dummy column and assign it as a Class. To create a new Column (Figure 13):
- In the Filters section, click Choose
- Under Filters -> Unsupervised -> Attributes, Click ADD
- Left-Click on the text box
- In the attributeName field, type: Class
- Click OK
- Click Apply
Now you see a new column called Class added as the last column in the dataset. The column is populated with NAN values, which indicates that it is empty.

3. Now we need to assign this column as the class column (Figure 14).
- In the Filters section, click Choose
- Under Filters -> Unsupervised -> Attributes, Click ClassAssigner (The ClassAssigner is a filtering processing to indicate which column is the class column)
- By default the ClassAssigner chooses the last column as the class
- Click Apply

Now that the dataset is preprocessed, we can visualise it.