Data visualisation

Click on the ‘Visualize’ tab. You can see each pair of the features visualised. This is one way for visualising high dimensional datasets.

Figure 15 The WEKA ‘Visualize’ tab.

It is more challenging to cluster high dimensional datasets, since data samples are scattered in a large dimensional space, which makes identifying similar objects difficult. Therefore, it is recommended to reduce the dimensionality of the dataset. This will also help in visualising the dataset and the resulting clusters.

Dimensionality reduction is a process to project high-dimensional datasets into lower dimensions. One common approach is Principal Component Analysis (PCA), which uses orthogonal eigenvectors that summarises most of the variance in the dataset. PCA is an unsupervised learning approach, which makes it suitable for unlabeled datasets such as the one we are using in this exercise. Using PCA, users have the flexibility to choose the number of dimensions to project the dataset. However, the resulting features lose their original meaning. So it is not recommended to use PCA if it is essential to preserve the meaning of the features. Other approaches like feature ranking using variance would be suitable.