Principal Component Analysis & Clustering
9. Principal Component Analysis & ClusteringΒΆ
Objectives
Understand principal component analysis (PCA) for dimensionality reduction.
Apply PCA to real-world problems and interpret the results.
Understand \(K\)-means and hierarchical clustering for data grouping.
Apply \(K\)-means and hierarchical clustering to real-world problems and interpret the results.
Expected time to complete: 3 hours
The previous chapters have introduced various regression and classification methods, which are all supervised learning methods. In supervised learning, we typically have access to a set of \(N\) observations (samples), with each having \(D\) features \(x_1, x_2, \ldots, x_D\) and a corresponding response \(y\). The goal of supervised learning is to predict \(y\) using \(x_1, x_2, \ldots, x_D\).
This chapter will instead focus on unsupervised learning for the setting in which we have only a set of features \(x_1, x_2, \ldots, x_D\) measured on \(N\) observations. The task here is not to predict a response, because we do not have an associated response variable \(y\). Rather, the goal is to discover or explore interesting things about the features \(x_1, x_2, \ldots, x_D\) by learning their lower-dimensional representations or groupings. For example, is there an informative way to visualise the data samples via their two-dimensional or three-dimensional representations? Can we discover interesting subgroups among the \(N\) observations? Unsupervised learning answers such questions. Here, we will focus on principal component analysis (PCA) for data visualisation or data pre-processing (typically before supervised learning methods are applied) and \(K\)-means / hierarchical clustering for discovering subgroups in data.
Ingredients: principal component analysis
Input: high-dimensional (\(D\)) features of data samples
Output: lower-dimensional (\(M<D\)) features of data samples
Model: linear (simple weighted) transformation (mapping) of data samples from high-dimensional space to lower-dimensional space
Hyperparameter: \(M\), the number of principal components, i.e. the lower dimension
Parameter(s): a \(D \times M\) transformation matrix mapping the original \(D\)-dimensional input to the transformed \(M\)-dimensional output
Loss function: maximise the variance of the lower-dimensional features in the transformed space
Learning algorithm: eigendecomposition or singular value decomposition
Transparency: principal component analysis
System logic
Condition to produce certain output: to produce a certain lower-dimensional representation \( \mathbf{y} \), we can use the inverse of the same transformation matrix to project the lower-dimensional representation \( \mathbf{y} \) back to the original high-dimensional space to obtain a reconstructed input \( \hat{\mathbf{x}} \). This reconstructed input \( \hat{\mathbf{x}} \) is the input that produces the output \( \mathbf{y} \) with the transformation matrix.
Ingredients: \(K\)-means clustering
Input: features of data samples
Output: cluster memberships for each sample, assigned from \(k=1, 2, \cdots, K\)
Model: find \(K\) cluster centres in the feature space so that samples in the same cluster are as similar to each other as possible
Hyperparameter(s): \(K\), the number of clusters
Parameter(s): the \(K\) cluster centres in the feature space, i.e. in terms of the \(D\) features, so there are \(K \times D\) parameters
Loss function: minimise the sum of squared distances between data samples and their assigned cluster centres
Learning algorithm: repeated cluster membership assignment and cluster centre update after an initial random assignment of cluster centres
Transparency: \(K\)-means clustering
System logic
Condition to produce certain output: to produce a certain cluster assignment \(k\) (i.e. an output \(y=k\)), locate a data point \(\mathbf{x}\) whose distance to the centre of the \(k\)th cluster is smaller than its distance to any other cluster centres then this \(\mathbf{x}\) belongs to cluster \(k\).