Data clustering

We will work with k-means, which is a popular clustering algorithm. It works by randomly selecting a set of centroids, each centroid represents a cluster. Data samples are assigned to the clusters based on their similarity to the centroids, i.e. data points that represent the centre of the cluster. It is a fast algorithm and easy to implement. However, it requires knowing the number of clusters in advance.

Click on the Cluster Tab. In the Clusterer Section, choose SimpleKmeans (Figure 20).

We will apply SimpleKmeans with the default parameters (number of clusters is 2 and distance metric is Euclidean distance). However, to change the parameters, once selected, left-click on the box showing ‘SimpleKMeans’. If you would like to try changing any of the model’s parameters, make the changes then click OK, followed by Start to cluster the data.

Figure 20 Applying SimpleKMeans clustering.

Sum of squared errors (SSE) is used to assess the quality of clusters. The output shows that SSE= 251.08. You can see that 60% of the dataset is assigned to one cluster, and 40% of the data is assigned to another cluster. To visualise the clusters (Figure 21):

  1. In the Result list section, right click on the output
  2. Click “visualise the cluster assignments”
Figure 21 Visualisation of SimpleKMeans clustering.

To overcome the shortcomings of k-means, we can apply another clustering approach called Expectation Maximization (EM). EM performs cross-validation on the dataset to determine the number of clusters.

In the Clusterer Section, choose EM.

You can set the parameters, by left-clicking on the text box. However, we will apply EM with the default parameters.

After setting the model’s parameters, Click OK, then click Start to cluster the data (Figure 22).

Figure 22 Applying EM clustering.

You can see that EM created 5 clusters. To visualise the clusters (Figure 23):

  1. In the Result list section, right click on the output
  2. Click “visualise the cluster assignments”
Figure 21 Visualisation of EM clustering.