This lab focuses on statistical techniques that data scientists should know: Principal Component Analysis, Partitional Clustering, and Hierarchical Clustering.
Directions for all labs (read before starting)
The “Lab” section is something you will work on with a partner using paired programming, a framework defined as follows:
Partners are encouraged to switch roles throughout the “Lab” section.
\(~\)
# load the following packages
# install.packages("factoextra")
# install.packages("psych")
# install.packages("corrplot")
# install.packages("cluster")
library(cluster)
library(psych)
library(factoextra)
library(ggplot2)
library(dplyr)
library(corrplot)
So far we’ve relied on data visualization as our primary method for identifying trends, but this has its limitations, especially when the number of variables is large.
Consider data consisting of \(n\) measurements (observations) of \(p\) variables, represented by the matrix \(\mathbf{X}\):
\[ \mathbf{X} = \begin{pmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,p} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n,1} & x_{n,2} & \cdots & x_{n,p} \end{pmatrix} \] Our goal in this lab is to find relationships in the columns of \(\mathbf{X}\) (ie: find patterns across variables) using principal component analysis (PCA). In doing so, we hope to come up with interpretable, low-dimensional representations of the key trends contained in our data.
\(~\)
The main idea behind PCA for dimension reduction is as follows:
Consider the following example:
In Scenario A, the variables X and Y are redundant
In scenario B, where isn’t much redundancy in X and Y
The next figure (below) shows the first principal component (red) and second principal component (blue) for each scenario:
The first (red) component is constructed to contain the maximum amount of variation in the data, and the second (blue) component is orthogonal to the first.
We can now define the location of each data-point using a coordinate scale in these new dimensions (ie: \(\{PC_1, PC_2\}\) rather than \(\{X, Y\}\)). In scenario A, the \(PC_2\) coordinates have so little variation that we don’t lose much by ignoring them and only considering the \(PC_1\) coordinate of an observation (reducing two dimensions down to just one dimension)
\(~\)
Principal components can be expressed by linear combinations of the original variables:
\[PC_1 = \phi_{11} X_1 + \phi_{21} X_2 + \ldots + \phi_{p1} X_p \\ PC_2 = \phi_{12} X_1 + \phi_{22} X_2 + \ldots + \phi_{p2} X_p \\ \ldots \]
The coefficients, \(\phi_{11}, \phi_{21}, \ldots\), are called loadings. They describe the contribution of a variable to a principal component. The matrix containing these loadings is sometimes called the rotation matrix.
If you plug-in the values of \(X_1, \ldots, X_p\) for an observation and multiply by the loadings the resulting value of \(PC_1\) is called a score. This is the location of that observation in the \(PC_1\) dimension (ie: position on the \(PC_1\) axis).
The first step in PCA is to standardize \(\mathbf{X}\), which prevents variables with larger values from dominating. We want the difference between 1m and 10 m (a value of 9) to be treated the same as that between 1000mm and 10000mm (a value of 9000).
After standardization, we’ll consider the covariance matrix \(\mathbf{C}= \tfrac{1}{n-1}\mathbf{X}^T\mathbf{X}\). Using techniques from linear algebra, we can factor this matrix:
\[\mathbf{C} = \mathbf{V}\mathbf{L}\mathbf{V}^T\]
\(~\)
Consider data consisting of \(n\) measurements (observations) of \(p\) variables, represented by the matrix \(\mathbf{X}\):
\[ \mathbf{X} = \begin{pmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,p} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n,1} & x_{n,2} & \cdots & x_{n,p} \end{pmatrix} \]
Clustering approaches fall into two broad categories:
This lab will focus on partitional clustering, in particular \(k\)-means and PAM clustering.
\(~\)
As a motivating example, consider the “USArrests” data from the PCA lab. Shown below are two clusters (red and blue) from a partitional method.
Which cluster should contain Virginia? Why?
Good clusters contain data-points that are similar to each other (cohesive), and also dissimilar from the members of other clusters (distinct).
To assess “similarity” we’ll use distance (sometimes called a metric). There are many possible ways to define distance, one of the most popular is Euclidean distance:
\[d_{a,b} = \sqrt{\sum_{j = 1}^{p} (x_{aj} - x_{bj})^2}\]
In words, the Euclidean distance between two data-points, \(x_a\) and \(x_b\), is the square root of the sum of squared differences in their values across each variable (we index these variables by \(j \in \{1, \ldots, p\}\)).
\(~\)
Perhaps the most well-known clustering approach is k-means clustering, which groups \(n\) observations into \(k\) distinct groups (clusters) such that within cluster variation is minimized.
The original \(k\)-means algorithm solves for the \(k\) cluster centroids (\(\mu_1, \ldots, \mu_k\)) that minimize “Total Withiness”:
\[ \text{Total Withiness} = \sum_{k=1}^{k} W(C_k) = \sum_{k=1}^{k} \sum_{x_i \in C_k} (x_i - \mu_k)^2 \]
To apply \(k\)-means, the analyst must specify \(k\), the number of clusters, then the \(k\)-means algorithm is used to group the observations such that total squared distance from each data point to the center of its assigned cluster is minimized.
The gif below illustrates the progression of the \(k\)-means algorithm for \(k = 2\):
\(~\)
Clustering finds groups of similar data-points. Ideally, clusters are cohesive (all members are close to each other) with separation from each other (cluster members are far from the members of other clusters).
PArtitional Clustering (above) relies upon euclidean distance: \[d_{euclidean}(a,b) = \sqrt{\sum_{j = 1}^{p} (x_{aj} - x_{bj})^2}\]
However, this reliance is very restrictive because euclidean distance is not defined for categorical variables.
An alternative distance measure is Gower distance: \[d_{Gower}(a,b) = \tfrac{1}{p}\sum_{j = 1}^{p} h_j( x_{a,j}, x_{b,j})\]
The function \(h_j()\) differs depending upon whether \(X_j\) is numeric or categorical
Notice that \(h_j()\) has a range of
\([0,1]\) regardless of whether \(X_j\) is categorical or numeric, so Gower
distance implicitly standardizes our data (so we don’t need to worry
about using scale() function when clustering using Gower
distance).
\(~\)
A distance matrix, sometimes called a dissimilarity matrix, is an \(n\) by \(n\) matrix containing the pairwise distances between all pairs of data-points.
To better understand Gower distance and distance matrices, let’s look at a simple data set consisting of 6 yes/no questions that describe \(n = 20\) different animal species:
| Warm_blooded? | Can_fly? | Vertebrate? | Endangered? | Lives_in_groups? | Has_hair? | |
|---|---|---|---|---|---|---|
| ant | No | No | No | No | Yes | No |
| bee | No | Yes | No | No | Yes | Yes |
| cat | Yes | No | Yes | No | No | Yes |
| cephalopod | No | No | No | No | No | Yes |
| chimp | Yes | No | Yes | Yes | Yes | Yes |
| cow | Yes | No | Yes | No | Yes | Yes |
| duck | Yes | Yes | Yes | No | Yes | No |
| eagle | Yes | Yes | Yes | Yes | No | No |
| elephant | Yes | No | Yes | Yes | Yes | No |
| fly | No | Yes | No | No | No | No |
| frog | No | No | Yes | Yes | NA | No |
| hermit | No | No | Yes | No | Yes | No |
| lion | Yes | No | Yes | NA | Yes | Yes |
| lizard | No | No | Yes | No | No | No |
| lobster | No | No | No | No | NA | No |
| human | Yes | No | Yes | Yes | Yes | Yes |
| rabbit | Yes | No | Yes | No | Yes | Yes |
| salamander | No | No | Yes | No | NA | No |
| spider | No | No | No | NA | No | Yes |
| whale | Yes | No | Yes | Yes | Yes | No |
\(~\)
We can calculate a matrix of Gower distances using the
daisy() function in the cluster package:
animals <- read.csv("https://remiller1450.github.io/data/animals.csv", row.names = 1, stringsAsFactors = TRUE)
D <- daisy(animals, metric = "gower") ## Calculate distance matrix
fviz_dist(D) ## Visualize
We can then visualize the distance matrix using
fviz_dist().
\(~\)
Some clustering algorithms can use the distance matrix, rather than the raw data, to identify clusters. In this lab we’ll introduce hierarchical clustering, which can be applied to any distance matrix.
In contrast with prototype-based approaches like \(k\)-means or PAM, hierarchical clustering organizes observations into a nested sets. These sets can be displayed using a tree-like structure known as a dendrogram:
ag <- agnes(D) ## Apply a hierarchical clustering algorithm
fviz_dend(ag) + labs(title = "Clustering Animals") ## Display the dendrogram
Note that we can always obtain partitional clusters by “cutting” a dendrogram:
fviz_dend(ag, k = 4) + labs(title = "Clustering Animals (k = 4)")
In this example, we asked for \(k = 4\) clusters, so the dendrogram was cut at a height of approximately 0.4. Recall that this height indicates that the members of these clusters differ by an average Gower distance of around 0.4, so we can say that they’re relative different when considering the scale of Gower distance.
\(~\)
The USArrests data set contains four variables for each
of the 50 states.
data("USArrests")
head(USArrests)
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
If we create a scatterplot matrix (using
pairs()), we see signs of redundancy, particularly between
“Assault” and “Murder”:
## Step 1 - check for signs of redundancy
pairs(USArrests, lower.panel = NULL)
To perform PCA, we first must standardize the data (using
scale()):
## Step 2 - standardize the numeric variables
USArrests_std <- scale(USArrests)
We then use the prcomp() function to find the principal
components:
## Step 3 - perform PCA on the standardized data
P <- prcomp(USArrests_std)
P$rotation
## PC1 PC2 PC3 PC4
## Murder -0.5358995 -0.4181809 0.3412327 0.64922780
## Assault -0.5831836 -0.1879856 0.2681484 -0.74340748
## UrbanPop -0.2781909 0.8728062 0.3780158 0.13387773
## Rape -0.5434321 0.1673186 -0.8177779 0.08902432
Next, let’s quickly see how we could have used linear algebra instead
of prcomp():
n <- nrow(USArrests) ## find n obs
C <- 1/(n-1) * t(USArrests_std) %*% USArrests_std ## find Cov matrix
P2 <- svd(C) ## singular value decomposition (C = UDV^T)
P2$u ## rotation matrix
## [,1] [,2] [,3] [,4]
## [1,] -0.5358995 0.4181809 -0.3412327 0.64922780
## [2,] -0.5831836 0.1879856 -0.2681484 -0.74340748
## [3,] -0.2781909 -0.8728062 -0.3780158 0.13387773
## [4,] -0.5434321 -0.1673186 0.8177779 0.08902432
Finally, let’s find the scores of each state in \(PC_1\) and \(PC_2\) and graph the states in these new dimensions:
scores <- as.data.frame(P$x)
scores$State <- rownames(P$x) ### row names were used to store states
ggplot(data = scores, aes(x = PC1, y = PC2, label = State)) + geom_text()
We can look at a few states to help us further understand the usefulness of PCA:
Question #1: Describe the states South Carolina and Wisconsin using the graph displayed above.
The code below creates a subset of the “colleges2019” data set with no missing data in the variables that are selected:
colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
cd <- colleges %>%
select(Name, State, Enrollment, Adm_Rate, ACT_median,
Net_Tuition, FourYearComp_Males, FourYearComp_Females,
Debt_median, Salary10yr_median)
cd <- cd[complete.cases(cd),] ## Keeps only complete cases (no missing values)
In the questions that follow you will progress through the basic steps of PCA:
Question #2 (Part A): Create a new data frame where all the numeric variables are standardized. To do this you must remove the “Name” and “State” columns.
Question #2 (Part B): Perform PCA using the
prcomp() function. Then, considering only loadings above
0.4 in absolute value (a commonly used threshold), come up with a label
for the first principal component.
Question #2 (Part C): Considering only loadings above 0.4 in absolute value (a commonly used threshold), come up with a label for the second principal component.
Question #2 (Part D): Using geom_text()
create a graphic that displays the \(PC_1\) and \(PC_2\) scores of all colleges in
Iowa. Briefly comment on the position of Grinnell College in this
graph, mentioning why you think its location is so far from the other
Iowa colleges.
\(~\)
Principal component analysis is useful when your data contain several numeric variables that are related with each other such that they might be measuring variation in the same underlying factor.
The first step in determining if PCA is viable is to look at the
pairwise relationships between variables. The corrplot
package provides several useful functions for this. Demonstrated below
are a few of the package’s functions:
## Create the correlation matrix (dropping identifiers)
cormat = cor(select(cd, -Name,-State))
## Default Corrplot
corrplot(cormat)
## Mixed corrplot
corrplot.mixed(cormat, tl.pos = 'lt', tl.cex = 0.5, number.cex = 0.55)
In the second example, the argument tl.pos moves the
text labels to the left and top, “tl”. You could also place them on the
diagonal using “d”. The arguments tl.cex and
number.cex change the size of the text labels and
correlation coefficients (respectively).
Question #3: Notice the grouping of variables with dark blue circles in the center of the corrplot. How does this grouping relate to the first principal component you described in Question 2? How would explain this relationship?
\(~\)
Principal components are linear combinations of the original variables. And if the data are standardized, a variable’s loading is a direct reflection of its importance.
The fviz_contrib() function provides a convenient way to
visualize variable contributions:
P <- prcomp(USArrests_std)
fviz_contrib(P, choice = "var", axes = 1, top = 4) ## PC1
fviz_contrib(P, choice = "var", axes = 2, top = 3) ## PC2
The horizontal reference line shows the expected loading if each variable contributed uniformly to all of the principal components. Any variable whose loading is the above the reference line can be viewed as “important” in that dimension.
Question #4: Use fviz_contrib() to
display the contributions of variables to the first principal component
in the PCA of the “colleges” data you perform in Question #2. Do the
variables that appear above the reference line correspond with the ones
you mentioned in Part B of Question #2? Briefly explain.
\(~\)
Principal components are constructed so that earlier components explain more variability in the data than the later components. Thus, not all components are equally valuable - but how should we decide which to keep and interpret and which to discard?
Recall that the matrix \(\mathbf{L}\) contains the eigenvalues of the covariance matrix. These values must sum to \(p\) (the number of variables in the original data), so if we divide them by \(p\) we get a measure of variation explained by a component.
Note that the prcomp() function will report the square
root of the eigenvalues:
P <- prcomp(USArrests_std) ## US Arrests data
sum(P$sdev^2) ## Recall there were 4 variables
## [1] 4
A popular graphic for visualizing explained variation is the scree plot:
fviz_screeplot(P)
The x-axis displays each principal component, while the y-axis displays the corresponding percentage of variance explained by that component. For the “USArrests” data, the first two principal components account for nearly 90% of variation across different states.
We can find the precise y-values of this graph ourselves by dividing the eigenvalues by \(p\) (the number of columns of the original data):
P$sdev^2/ncol(USArrests_std)
## [1] 0.62006039 0.24744129 0.08914080 0.04335752
Question #5: Construct a scree plot for the PCA you performed on the “colleges” data in Question #2. Next, use the PCA results to find the exact proportion of variation explained by the first three components (ie: calculate and sum the y-values displayed on your scree plot for \(PC_1\), \(PC_2\), and \(PC_3\)).
\(~\)
Depending on the goals of an analysis, you might subjectively choose to retain a certain number of principal components by inspecting the scree plot.
Alternatively, some suggest retaining all components that explain at least as much variation as a variable in the original data. This corresponds to an eigenvalue \(\geq 1\) or explained variation \(\geq 1/p\).
While popular, neither these methods are particularly rigorous. A more reliable way to determine the number of components to retain is parallel analysis.
Parallel analysis is a statistical approach that compares the
eigenvalues of the real data to a distribution of those found in
randomly generated data of the same size (think of a
randomization test from Sta-209). We can perform parallel analysis using
the fa.parallel() function in the psych
library:
fa.parallel(USArrests_std, fa = "pc") ## fa = "pc" indicates PCA
## Parallel analysis suggests that the number of factors = NA and the number of components = 1
For the “USArrests” data, parallel analysis suggests we should only retain the first principal component (which we previously labeled “overall violent crime”).
Question #6: Perform parallel analysis on the “colleges” data that you’ve used in previous several questions. Report the number of components that the method suggests should be retained.
\(~\)
A biplot is a commonly used graph that combines the scores of individual observations and the loadings of the original variables on two of the principal component axes.
fviz_pca_biplot(P, axes = c(1,2), repel = TRUE) ## PC1 and PC2 for USArrests
Notice how the argument repel = TRUE can be used to
prevent text labels from overlapping.
\(~\)
In R, the kmeans() function is used to perform \(k\)-means clustering. For example, we can
use the code below to assign states in the USArrests data
into two clusters:
data("USArrests")
k2 <- kmeans(USArrests, centers = 2, nstart = 25)
The centers argument is the number of clusters we want
(ie: \(k\)), while nstart
is a seed for the random starting point of the algorithm. There is no
guarantee that k-means finds a globally optimal set of cluster
assignments, so nstart is used to ensure reproducibility.
That said, in most circumstances the starting point has little or no
impact on the clustering results.
Next, notice that kmeans() outputs a list containing 9
different objects. The code below prints the coordinates of the cluster
centroids (multi-dimensional means), which are stored in the
centers object within that list:
k2$centers
## Murder Assault UrbanPop Rape
## 1 11.857143 255.0000 67.61905 28.11429
## 2 4.841379 109.7586 64.03448 16.24828
These centroids are often called prototypes since each of them is a central point that can be used to describe the average characteristics of that cluster’s members.
If we are interested in the cluster assignments of individual
data-points, we can find them in the cluster object
returned by kemans(). The code below demonstrates how to
find the assigned clusters for the states “Iowa” and “Illinois”, which
were row names in USArrests data frame.
k2$cluster[c("Iowa", "Illinois")]
## Iowa Illinois
## 2 1
Question #7: Apply \(k\)-means clustering with \(k = 4\) to the USArrests data. Then, find the centroid prototype of the cluster that contains Minnesota. How does the cluster containing Minnesota compare to the other clusters (in terms of labels you derive from the cluster centroids)?
\(~\)
We can visualize the results \(k\)-means using
fviz_cluster(), which displays the clusters on a scatter
plot that uses the first two principal component scores as the
X-Y coordinates for each observation.
k2 <- kmeans(USArrests, centers = 2, nstart = 25)
fviz_cluster(k2, data = USArrests)
Notice that fviz_cluster() labels each data-point using
the rownames of the data object. You should recall that you can
change these using the rownames() function. It’s also worth
knowing that you can also use the argument repel = TRUE (in
fviz_cluster()) to prevent overlapping labels.
Finally, instead of using PC1 and PC2, you can specify any of the original variables:
fviz_cluster(k2, data = USArrests, choose.vars = c("UrbanPop", "Murder"), stand = FALSE, repel = TRUE, title = "Without Standardization")
Note: This code broke with the latest R/GGplot2 package updates. If I can find a solution, I’ll fix it this week. Here is a quick GGplot to recreate, but it does not show the region shapes
USArrests %>% mutate(cluster = factor(k2$cluster)) %>%
ggplot(., aes(x = UrbanPop, y = Murder, color = cluster,)) +
geom_point(size = 3) +
geom_point(data = as.data.frame(k2$centers),
aes(x = UrbanPop, y = Murder),
color = "black", size = 5, shape = 8) +
labs(title = "Without Standardization") +
theme_minimal()
You should note that clusters will generally appear most distinct when viewed using the PC1 and PC2 axes.
Question #8 (Part A): Use the scale()
function to standardize the USArrests data, then use
fviz_cluster() to graph the \(k\)-means clusters (for \(k = 2\)) found using the standardized data
with “UrbanPop” and “Murder” as the \(X\) and \(Y\) variables (similar to the graph in the
previous example).
Question #8 (Part B): Explain why the clusters are somewhat different after standardizing the data. Do you think the data should be standardized prior to clustering? Briefly explain.
\(~\)
As previously mentioned, the \(k\)-means algorithm requires the user to supply the value of \(k\) before the algorithm can be run.
As a general guideline, \(k\) should be large enough that clusters are relatively homogeneous, but it should also be small enough to avoid unnecessary complexity. The next few sections will cover three different methods for choosing a reasonable value of \(k\).
\(~\)
The \(k\)-means algorithm solves for centroids that minimize “Total withiness” or \(\sum_{k=1}^{k} W(C_k) = \sum_{k=1}^{k} \sum_{x_i \in C_k}(x_i - \mu_k)^2\). This quantity always decreases as \(k\) increases, but there are diminishing returns. Eventually, adding more cluster centroids won’t lead to much improvement in Total Withiness.
With this relationship in mind, we can graph “Total withiness”,
abbreviated “wss”, versus different choices of \(k\) using the fviz_nbclust()
function:
fviz_nbclust(USArrests, FUNcluster = kmeans, method = "wss", k.max = 8)
The Elbow Method suggests the optimal number of clusters corresponds with this plot’s “elbow”, or the point where the curve starts flatting/bending towards more gradual decreases in wss as \(k\) increases.
For the USArrests data (shown in this example), the
elbow method suggests either 2 or 3 clusters would be reasonable, as
going further doesn’t lead to much of a decrease in wss.
\(~\)
The Silhouette Method measures the fit of observations within their assigned clusters using quantities known as silhouette values, which are a comparison of the similarity of an observation to its assigned cluster (cohesion) relative to its similarity to other clusters (separation).
The difference between these two quantities (the measure of cohesion and the measure of separation) is standardized to produce a silhouette value between -1 to +1, with high values indicating an observation has “good fit” in its assigned cluster, and low values indicating “poor fit” within the assigned cluster.
If you’d like a more mathematical description of how silhouette values are calculated, the Wikipedia page is fairly accessible.
The average silhouette (across the entire data set) can be
used to evaluate the effectiveness of different values of \(k\) in creating clusters that are
distinctive and cohesive. We can visualize this relationship by
supplying the “silhouette” method to fviz_nbclust().
fviz_nbclust(USArrests, FUNcluster = kmeans, method = "silhouette", k.max = 8)
For the USArrest data, the average silhouette metric
suggests that \(k = 2\) is optimal.
Silhouettes are also useful for evaluating the cluster membership of individual observations:
k2 <- kmeans(USArrests, centers = 2, nstart = 25)
sil <- silhouette(k2$cluster, dist(USArrests), ordered = FALSE)
row.names(sil) <- row.names(USArrests) # Needed to use label option
fviz_silhouette(sil, label = TRUE)
## cluster size ave.sil.width
## 1 1 29 0.60
## 2 2 21 0.58
Here we see there are a handful of states that aren’t great fits in their assigned clusters, so we’d want to avoid over-interpreting their assignments when sharing our results.
If we’d like to take a closer look, we can always filter the silhouette information:
## Filter to show low sil states
as.data.frame(sil) %>% filter(sil_width < 0.3)
## cluster neighbor sil_width
## Arkansas 2 1 0.11160017
## Missouri 1 2 0.06586521
## Rhode Island 1 2 0.17099027
## Tennessee 2 1 0.09530834
Note that the output of the silhouette() function is not
a data frame, so we must coerce it before filtering.
Question #9: For each part of the following question
you should use cds, which is a standardized subset of the
college scorecard data created by the code given below:
## Keep this code as pre-processing
colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
cd <- colleges %>% filter(State == "IA") %>%
select(Name, Enrollment, Adm_Rate, ACT_median,
Net_Tuition, FourYearComp_Males, FourYearComp_Females,
Debt_median, Salary10yr_median)
cd <- cd[complete.cases(cd),] ## Keeps only complete cases (no missing values)
rownames(cd) <- cd$Name
## Use the data below for this question
cds <- scale(select(cd, -Name))
Part A: Using the Elbow Method, determine an appropriate number of \(k\)-means clusters for these data. Then, report the centroids of each cluster. Note: because the data have been standardized, the units describing these centroids are standard deviations.
Part B: Use the Silhouette Method to determine an appropriate number of \(k\)-means clusters for these data.
Part C: Create a silhouette plot using the value of \(k\) selected in Part B. Which college(s) have negative silhouettes? Briefly explain what these values say about these colleges and their cluster membership.
\(~\)
The gap statistic is a statistical approach that compares the observed wss for a certain choice of \(k\) versus a distribution of wss that would be expected if there were no actual relationships between observations (ie: a null hypothesis of no clustering). A larger gap statistic indicates a greater difference between the real data and what could be explained by random chance.
For additional mathematical details, see this paper.
The code below calculates the gap statistic for \(k=1\) through \(k=5\) for the “USArrests” data:
fviz_nbclust(USArrests, FUNcluster = kmeans, method = "gap", k.max = 5)
By default, fviz_nbclust will use two criteria when
recommending a value of \(k\) using the
gap statistic:
The aim is to provide more parsimonious results (fewer clusters) while still achieving performance that isn’t significantly different from the maximum.
Remark: You’ll notice in the USArrests example that each method suggested a slightly different number of clusters. This is a common occurrence, and the choice of \(k\) ultimately comes down to a decision made by the analyst. You should view the elbow, silhouette, and gap statistic as different tools that can help guide your choice.
Question #10: Use the gap statistic to determine the
optimal choice of \(k\) for the
cds data set used in Question #9. Does this method agree
with the Elbow and Silhouette Methods? Briefly explain.
\(~\)
\(k\)-means clustering is a straightforward and widely used method, but it has a few weaknesses:
Partitioning Around Medoids (PAM) is an alternative clustering approach that searches for \(k\) data-points called medoids, which serve as cluster centers. The remaining data-points are assigned to the cluster defined by the nearest medoid.
usarrests_std <- scale(USArrests)
us_pam <- pam(usarrests_std, k = 3) ## Apply PAM clustering
us_pam$medoids ## Print the medoids
## Murder Assault UrbanPop Rape
## New Mexico 0.8292944 1.3708088 0.3081225 1.1603196
## Oklahoma -0.2727580 -0.2371077 0.1699510 -0.1315342
## New Hampshire -1.3059321 -1.3650491 -0.6590781 -1.2525644
fviz_cluster(us_pam) ## Plot the clusters
PAM addresses many of the drawbacks of \(k\)-means. It is more robust to outliers, and the cluster centers are themselves data-points, making the results more interpretable. And, we’ll later see PAM can be used to cluster data with categorical variables.
The biggest weakness of PAM is its computational efficiency. PAM doesn’t scale well to large datasets due to the complex constraints on the medoid update step. Additionally, PAM tends to be more sensitive to the starting medoids than k-means is to the starting centroids.
Question #10: Apply PAM clustering to the
cds data set used in Questions #8 and #9. Your analysis
should choose \(k\) using one of the
three approaches from the previous section, and you should report the
prototypes (cluster medoids) for each cluster.
To begin, we’ll introduce two basic philosophies behind different types of hierarchical clustering algorithms:
The acronyms AGNES (AGglomerative NESting) and DIANA (DIvisive ANAlysis) describe two of the most popular methods. The code below gives an example of each:
D <- daisy(animals, metric = "gower")
ag <- agnes(D)
fviz_dend(ag, k = 3) + labs(title = "Clustering Animals (AGNES)")
di <- diana(D)
fviz_dend(di, k = 3) + labs(title = "Clustering Animals (DIANA)")
Notice that AGNES and DIANA are greedy algorithms (meaning each cluster merge/split is locally optimal, without any consideration of the overall hierarchy). Consequently, agglomerative and divisive approaches will usually produce different dendrograms.
Specific details of the AGNES and DIANA algorithms are summarized below:
AGNES Algorithm:
At each step, AGNES first identifies the pair of most similar clusters using one of the following methods:
method = "average" - the average distance between all
data-points in one cluster relative to those in another clustermethod = "single" - the smallest distance between a
single data-point in one cluster relative to another cluster (nearest
neighbor)method = "complete - the smallest instance of the
largest distance between a single data-point in one cluster relative to
another cluster (furthest neighbor)The two clusters deemed most similar (ie: smallest metric according to the selected method) are merged and the algorithm moves on to determine the next merge. This is done until all clusters have been merged.
DIANA Algorithm:
This process is repeated until each data-point is it’s own cluster.
\(~\)
Question #11: The code below applies the AGNES
algorithm with method = "single" to the
USArrests data using the matrix of euclidean distances
calculated after the data are standardized. For Question #11, use the
provided code and its output to answer the following questions:
Part A: How many clusters were present at the initialization (start) of this clustering algorithm?
Part B: What was the first merge/split that
occurred? Use the distance matrix, D, to briefly justify
why this merge/split happened first. Hint: min(D)
will return the distance between the closest two data-points. You can
confirm that this distance corresponds to the merge/split you identified
by using View(as.matrix(D)).
Part C: If this dendrogram were cut to obtain \(k = 3\) clusters, how many members would belong to the largest of these clusters? Note: You may answer this question by inspecting the given output without writing any of your own code.
data("USArrests")
USArrests_std <- scale(USArrests)
D <- daisy(USArrests_std, metric = "euclidean")
ag1 <- agnes(D, method = "single")
fviz_dend(ag1, cex = 0.25) + labs(title = 'AGNES, Method = "Single"')
\(~\)
Question #12: The code below applies the DIANA
algorithm to the distance matrix calculated from the standardized
USarrests data. Using the code and output provided below,
answer the following questions:
Part A: How many clusters were present at the initialization (start) of this algorithm?
Part B: Briefly describe the merge/split resulting from the first step of the algorithm.
Part C: Which observation you believe was used to
form the “splinter group” in the first step of the algorithm? Justify
your answer. Hint: Because each row/column of the distance
matrix corresponds to an observation, you might look at row or column
means. Before doing so, you might need to coerce D into a
matrix using as.matrix().
di1 <- diana(D)
fviz_dend(di1, cex = 0.25) + labs(title = 'DIANA')
\(~\)
Dendrograms provide a unique perspective on the relationships between data-points, but oftentimes we’d still like to report cluster membership for an optimized choice of \(k\).
Fortunately, all of the methods described in our previous clustering lab can also be applied to hierarchical methods. Unfortunately, some of them will require all of our variables are numeric.
The code below demonstrates these three methods using DIANA clustering on the US Arrests data described in Question #12.
### The elbow method
fviz_nbclust(USArrests_std, FUNcluster = hcut, hc_func = "diana", method = "wss")
### The average silhouette method
fviz_nbclust(USArrests_std, FUNcluster = hcut, hc_func = "diana", method = "silhouette")
## The gap statistic method
fviz_nbclust(USArrests_std, FUNcluster = hcut, hc_func = "diana", method = "gap",
k.max = 6, verbose = FALSE)
The argument FUNcluster = hcut indicates the use of
hierarchical clustering that is to be cut into clusters, while the
argument hc_func specifies the specific hierarchical
clustering algorithm that should be used.
Question #13: The code below creates
cds, a standardized subset of the Colleges 2019 data set
that was used in the partitional clustering portion of the lab. For this
question, use the average silhouette method to find the optimal value of
\(k\) for AGNES clustering performed on
these data. Then, create a dendrogram displaying the AGNES clustering
results with branches colored according to the value of \(k\) you selected. You should use euclidean
distance for this analysis.
colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
cd <- colleges %>% filter(State == "IA") %>%
select(Name, Enrollment, Adm_Rate, ACT_median,
Net_Tuition, FourYearComp_Males, FourYearComp_Females,
Debt_median, Salary10yr_median)
cd <- cd[complete.cases(cd),] ## Keeps only complete cases (no missing values)
rownames(cd) <- cd$Name
cds <- scale(select(cd, -Name))
\(~\)
In this example you’ll explore data collected from an online personality test. These data measure the big-five personality traits, which are: “openness to experiences”, “conscientiousness”, “extraversion”, “agreeableness”, and “neuroticism”.
The personality test asked participants to rate 50 statements on a five-point scale (1 = disagree, 3 = neutral, 5 = agree). The researchers designed 10 statements to target each of the big five traits. For example, questions tagged with “E” were designed to measure extraversion, while questions tagged with “A” were designed to measure agreeableness.
The exact questions used, along with a more detailed description of the data, can be found here: codebook link
big5 <- read.delim("https://remiller1450.github.io/data/big5data.csv")
b5 <- select(big5, -race, -age, -engnat, -gender, -hand, -source, -country) ## Exclude demographics
Question #14 (Part A): Use the prcomp()
function to perform PCA on the big five data set. Be sure to standardize
the data. Then, use parallel analysis to determine the number of
components that should be retained.
Question #14 (Part B) Construct a scree plot to display the PCA results from Part A and report the variation explained by the first five principal components (the big five personality traits).
Question #14 (Part C): Use
fviz_contrib() with the argument top = 10 to
visualize the top 10 questions that contribute to the first principal
component. What label would you give this component?
Question #14 (Part D): Next, use
fviz_contrib() on components 2 through 5 (you don’t need to
record these results). How would label these principal components?
Question #14 (Part E) Create a data frame of scores
in components 1 through 5 and add the “gender” variable from the
original data to this data frame. Then, use group_by() and
summarize() to find the average big five personality factor
scores of male and female respondents (1 = Male, 2 = Female). Which two
personality factors appear to be the most related to gender? Which two
personality factors appear to be the least related to gender? Briefly
explain.
Question #15: The “mlb” data set contains offensive statistics for all Major League Baseball players with at least 200 at bats during the 2022 season. Aside from the player ID and “AB” (number of at bats), all other columns are reported as rates (per at bat).
mlb <- read.csv("https://remiller1450.github.io/data/mlb_hitters22.csv")
For those unfamiliar with baseball, here are some brief definitions of the remaining variables:
Part A: Create a new data frame,
mlb_std, where the player IDs are used as row names, the
variable “AB” is removed, and all other variables have been standardized
by the scale() function.
Part B: These data were standardized prior to you receiving them to reflect the rate of each event per batting attempt. Given this standardization has taken place, briefly explain why the data should be further standardized (into Z-scores by subtracting each variable’s mean and scaling by its standard deviation) prior to clustering.
Part C: Use the Silhouette Method to determine the number of clusters that should be used when applying \(k\)-means clustering to these data. Then, apply the \(k\)-means algorithm and print the cluster centroids.
Part D: Use the cluster centroids from Part B to provide a brief description of the members of Cluster #1 and Cluster #2. That is, how would you characterize the hitters in each of these clusters?
Part E: Suppose a sports columnist wants to write an article titled “7 different types of hitters”. Use PAM clustering to find a prototype players to use in this article (you use their Player ID). Then, report the number of hitters belonging to each of these clusters.
Part F: Use the silhouette() function
to find the silhouette values for each hitter using the clusters from
Part D. Then, report the player with the lowest value (worst fit) for
each cluster. Hint: Use as.data.frame() to coerce
the output of silhoutte() to a data.frame object, then add
the Player IDs. Next, use group_by() and
filter() to find the minimum silhouettes in each
cluster.
The aim of this section is to provide you an opportunity to apply clustering algorithms to new data sets in order to practice identifying scenarios where clustering might be a useful analysis approach.
Question #16: The “European foods” data set summarizes the results of a household survey administered to 16 European nations in 1975. The survey asked households whether they had various foods in their house at the time of the questionnaire. The data below display the percentage of households in each country that had the food in question (expressed as a numeric value between 0 and 100).
euro <- read.csv("https://remiller1450.github.io/data/europeanfoods.csv", row.names = 1, stringsAsFactors = TRUE)
Part A: Create a distance matrix using euclidean
distance without standardizing (since all variables are already
on a 0-100 scale). Then, visualize this matrix using
fviz_dist(). Which country’s household food habits are
most similar to Britain’s?
Part B: Use AGNES clustering with average distance as the merging method to create a dendrogram displaying these data.
Part C: Use DIANA clustering to create a dendrogram displaying these data.
Part D: Sweden, Norway, Denmark, and Finland are commonly grouped together as a region known as the Nordic Countries. Which algorithm, AGNES or DIANA, seems to do a better job capturing this grouping? Briefly explain your answer.
Part E: Use the average silhouette method to determine an optimal number of clusters for the application of AGNES clustering in Part B. Then, use this value to color the dendrogram you created in Part B.
Part F: Now use the gap statistic to determine an optimal number of clusters (again using the clustering results from Part B). Does this method agree with the average silhouette approach?
\(~\)
Question #17: The “starwars” data set documents the
attributes of major characters in the Star Wars film series. For this
question, you will perform a cluster analysis on the subset created
below. The variables definitions are largely self-explanatory, but you
can run ?starwars in the console for descriptions.
data("starwars")
sw <- starwars %>%
select(name, height, mass, hair_color, skin_color, eye_color, birth_year, sex, gender, homeworld, species) %>%
mutate_if(is.character, as.factor)
sw <- as.data.frame(sw[complete.cases(sw),])
rownames(sw) <- sw$name
sw <- select(sw, -name)
Part A: After inspecting the contents of
sw, briefly explain why Gower distance should be used to
cluster these data.
Part B: Create a distance matrix and visualize it
using fviz_dist().
Part C: Using the AGNES algorithm, create a dendrogram displaying these data.
Part D (optional, not graded): If you are familiar with Star Wars, do any relationships displayed in the dendrogram from Part C stand out as particularly interesting?