This lab focuses on statistical techniques that data scientists should know: Principal Component Analysis, Partitional Clustering, and Hierarchical Clustering.

Directions for all labs (read before starting)

  1. Please work together with your assigned partner. Make sure you both fully understand something before moving on.
  2. Record your answers to lab questions separately from the lab’s examples. You and your partner should only turn in responses to lab questions, nothing more and nothing less.
  3. Ask for help, clarification, or even just a check-in if anything seems unclear.

The “Lab” section is something you will work on with a partner using paired programming, a framework defined as follows:

Partners are encouraged to switch roles throughout the “Lab” section.

\(~\)

Preamble

Packages and Datasets

# load the following packages
# install.packages("factoextra")
# install.packages("psych")
# install.packages("corrplot")
# install.packages("cluster")
library(cluster)
library(psych)
library(factoextra)
library(ggplot2)
library(dplyr)
library(corrplot)

Principal Component Analysis (PCA)

Introduction

So far we’ve relied on data visualization as our primary method for identifying trends, but this has its limitations, especially when the number of variables is large.

Consider data consisting of \(n\) measurements (observations) of \(p\) variables, represented by the matrix \(\mathbf{X}\):

\[ \mathbf{X} = \begin{pmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,p} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n,1} & x_{n,2} & \cdots & x_{n,p} \end{pmatrix} \] Our goal in this lab is to find relationships in the columns of \(\mathbf{X}\) (ie: find patterns across variables) using principal component analysis (PCA). In doing so, we hope to come up with interpretable, low-dimensional representations of the key trends contained in our data.

\(~\)

Dimension Reduction

The main idea behind PCA for dimension reduction is as follows:

  1. Each observation (data-point) has a location in a \(p\)-dimensional space, but not all of these dimensions are equally interesting.
  2. PCA derives a set of new dimensions ordered by variation. Dimensions without much variation (few differences across observations) aren’t very interesting and can be discarded.

Consider the following example:

In Scenario A, the variables X and Y are redundant

  • If you know an observation’s X-value, you essentially know it’s Y-value too.
  • One dimension (ie: a line) to capture most of the variation in the variables X and Y.

In scenario B, where isn’t much redundancy in X and Y

  • Knowing an observation’s X-value tells you very little about its Y-value.
  • One dimension cannot effectively capture the variation in these two variables, we need a two-dimensional plane.

The next figure (below) shows the first principal component (red) and second principal component (blue) for each scenario:

The first (red) component is constructed to contain the maximum amount of variation in the data, and the second (blue) component is orthogonal to the first.

We can now define the location of each data-point using a coordinate scale in these new dimensions (ie: \(\{PC_1, PC_2\}\) rather than \(\{X, Y\}\)). In scenario A, the \(PC_2\) coordinates have so little variation that we don’t lose much by ignoring them and only considering the \(PC_1\) coordinate of an observation (reducing two dimensions down to just one dimension)

\(~\)

Mathematical Details

Principal components can be expressed by linear combinations of the original variables:

\[PC_1 = \phi_{11} X_1 + \phi_{21} X_2 + \ldots + \phi_{p1} X_p \\ PC_2 = \phi_{12} X_1 + \phi_{22} X_2 + \ldots + \phi_{p2} X_p \\ \ldots \]

The coefficients, \(\phi_{11}, \phi_{21}, \ldots\), are called loadings. They describe the contribution of a variable to a principal component. The matrix containing these loadings is sometimes called the rotation matrix.

If you plug-in the values of \(X_1, \ldots, X_p\) for an observation and multiply by the loadings the resulting value of \(PC_1\) is called a score. This is the location of that observation in the \(PC_1\) dimension (ie: position on the \(PC_1\) axis).

The first step in PCA is to standardize \(\mathbf{X}\), which prevents variables with larger values from dominating. We want the difference between 1m and 10 m (a value of 9) to be treated the same as that between 1000mm and 10000mm (a value of 9000).

After standardization, we’ll consider the covariance matrix \(\mathbf{C}= \tfrac{1}{n-1}\mathbf{X}^T\mathbf{X}\). Using techniques from linear algebra, we can factor this matrix:

\[\mathbf{C} = \mathbf{V}\mathbf{L}\mathbf{V}^T\]

  • \(\mathbf{V}\) is a matrix of eigenvectors, while \(\mathbf{L}\) is a diagonal matrix of eigenvalues
  • The columns of \(\mathbf{V}\) contain the loadings for each principal component
  • The elements of \(\mathbf{L}\) relate to the “variance explained” (importance) of each principal component

\(~\)

Partitional Clustering

Introduction

Consider data consisting of \(n\) measurements (observations) of \(p\) variables, represented by the matrix \(\mathbf{X}\):

\[ \mathbf{X} = \begin{pmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,p} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n,1} & x_{n,2} & \cdots & x_{n,p} \end{pmatrix} \]

  • PCA can be used to identify relationships in the columns of \(\mathbf{X}\) (ie: groupings/patterns among variables).
  • Clustering can be used to identify relationships in the rows of \(\mathbf{X}\) (ie: groupings/patterns among observations).

Clustering approaches fall into two broad categories:

  • Hierarchical Clustering organizes the data-points into a nested subsets (tree-like structures)
  • Partitional Clustering divides the data-points into non-overlapping subsets (regions in space)

This lab will focus on partitional clustering, in particular \(k\)-means and PAM clustering.

\(~\)

Distance

As a motivating example, consider the “USArrests” data from the PCA lab. Shown below are two clusters (red and blue) from a partitional method.

Which cluster should contain Virginia? Why?

Good clusters contain data-points that are similar to each other (cohesive), and also dissimilar from the members of other clusters (distinct).

To assess “similarity” we’ll use distance (sometimes called a metric). There are many possible ways to define distance, one of the most popular is Euclidean distance:

\[d_{a,b} = \sqrt{\sum_{j = 1}^{p} (x_{aj} - x_{bj})^2}\]

In words, the Euclidean distance between two data-points, \(x_a\) and \(x_b\), is the square root of the sum of squared differences in their values across each variable (we index these variables by \(j \in \{1, \ldots, p\}\)).

\(~\)

K-means

Perhaps the most well-known clustering approach is k-means clustering, which groups \(n\) observations into \(k\) distinct groups (clusters) such that within cluster variation is minimized.

The original \(k\)-means algorithm solves for the \(k\) cluster centroids (\(\mu_1, \ldots, \mu_k\)) that minimize “Total Withiness”:

\[ \text{Total Withiness} = \sum_{k=1}^{k} W(C_k) = \sum_{k=1}^{k} \sum_{x_i \in C_k} (x_i - \mu_k)^2 \]

  • \(x_i\) is a data-point belonging to cluster \(k\)
  • \(\mu_k\) is the centroid (multi-dimensional mean) of cluster \(k\)
  • \(W(C_k)\) is the “withiness” of cluster \(k\), measuring the distance of each member to the cluster center

To apply \(k\)-means, the analyst must specify \(k\), the number of clusters, then the \(k\)-means algorithm is used to group the observations such that total squared distance from each data point to the center of its assigned cluster is minimized.

The \(k\)-means Algorithm
  1. Initialize the positions of \(k\) centroids
  2. Calculate the distance between each data-point and each centroid (using Euclidean distance)
  3. Assign data-points to the nearest centroid
  4. Update the cluster centroids (to the mean of its assigned data-points)
  5. Repeat (2) - (4) until there is no change

The gif below illustrates the progression of the \(k\)-means algorithm for \(k = 2\):

\(~\)

Hierarchical Clustering

Distance and Categorical Variables

Clustering finds groups of similar data-points. Ideally, clusters are cohesive (all members are close to each other) with separation from each other (cluster members are far from the members of other clusters).

PArtitional Clustering (above) relies upon euclidean distance: \[d_{euclidean}(a,b) = \sqrt{\sum_{j = 1}^{p} (x_{aj} - x_{bj})^2}\]

However, this reliance is very restrictive because euclidean distance is not defined for categorical variables.

An alternative distance measure is Gower distance: \[d_{Gower}(a,b) = \tfrac{1}{p}\sum_{j = 1}^{p} h_j( x_{a,j}, x_{b,j})\]

The function \(h_j()\) differs depending upon whether \(X_j\) is numeric or categorical

  • For numeric variables, \(h_j(x_{a,j}, x_{b,j}) = \tfrac{|x_{a,j} - x_{b,j}|}{R_j}\), where \(R_j\) is the observed range of \(X_j\)
  • For categorical variables, \(h_j(x_{a,j}, x_{b,j}) = 0\) if the \(x_{a,j}\) and \(x_{b,j}\) are the same category, and \(h_j(x_{a,j}, x_{b,j}) = 1\) otherwise

Notice that \(h_j()\) has a range of \([0,1]\) regardless of whether \(X_j\) is categorical or numeric, so Gower distance implicitly standardizes our data (so we don’t need to worry about using scale() function when clustering using Gower distance).

\(~\)

Distance Matrices

A distance matrix, sometimes called a dissimilarity matrix, is an \(n\) by \(n\) matrix containing the pairwise distances between all pairs of data-points.

To better understand Gower distance and distance matrices, let’s look at a simple data set consisting of 6 yes/no questions that describe \(n = 20\) different animal species:

Warm_blooded? Can_fly? Vertebrate? Endangered? Lives_in_groups? Has_hair?
ant No No No No Yes No
bee No Yes No No Yes Yes
cat Yes No Yes No No Yes
cephalopod No No No No No Yes
chimp Yes No Yes Yes Yes Yes
cow Yes No Yes No Yes Yes
duck Yes Yes Yes No Yes No
eagle Yes Yes Yes Yes No No
elephant Yes No Yes Yes Yes No
fly No Yes No No No No
frog No No Yes Yes NA No
hermit No No Yes No Yes No
lion Yes No Yes NA Yes Yes
lizard No No Yes No No No
lobster No No No No NA No
human Yes No Yes Yes Yes Yes
rabbit Yes No Yes No Yes Yes
salamander No No Yes No NA No
spider No No No NA No Yes
whale Yes No Yes Yes Yes No

\(~\)

We can calculate a matrix of Gower distances using the daisy() function in the cluster package:

animals <- read.csv("https://remiller1450.github.io/data/animals.csv", row.names = 1, stringsAsFactors = TRUE)
D <- daisy(animals, metric = "gower")  ## Calculate distance matrix
fviz_dist(D)  ## Visualize

We can then visualize the distance matrix using fviz_dist().

  • Why are all entries on the diagonal of this matrix exactly 0?
  • What must be true of lions (“lio”) and flys (“fly”) for their distance to be 1?

\(~\)

Hierarchical Clustering

Some clustering algorithms can use the distance matrix, rather than the raw data, to identify clusters. In this lab we’ll introduce hierarchical clustering, which can be applied to any distance matrix.

In contrast with prototype-based approaches like \(k\)-means or PAM, hierarchical clustering organizes observations into a nested sets. These sets can be displayed using a tree-like structure known as a dendrogram:

ag <- agnes(D)  ## Apply a hierarchical clustering algorithm
fviz_dend(ag) + labs(title = "Clustering Animals")  ## Display the dendrogram

  • The top of the dendrogram is a single cluster containing all observations (ie: \(k = 1\))
  • The bottom of the dendrogram has every observation in its own cluster (ie: \(k = n\))
  • The y-value at which two clusters merge is a reflection of the distance/dissimilarity between the two clusters

Note that we can always obtain partitional clusters by “cutting” a dendrogram:

fviz_dend(ag, k = 4) + labs(title = "Clustering Animals (k = 4)")

In this example, we asked for \(k = 4\) clusters, so the dendrogram was cut at a height of approximately 0.4. Recall that this height indicates that the members of these clusters differ by an average Gower distance of around 0.4, so we can say that they’re relative different when considering the scale of Gower distance.

\(~\)

Lab Part 1: PCA

Example

The USArrests data set contains four variables for each of the 50 states.

data("USArrests")
head(USArrests)
##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

If we create a scatterplot matrix (using pairs()), we see signs of redundancy, particularly between “Assault” and “Murder”:

## Step 1 - check for signs of redundancy
pairs(USArrests, lower.panel = NULL)

To perform PCA, we first must standardize the data (using scale()):

## Step 2 - standardize the numeric variables
USArrests_std <- scale(USArrests)

We then use the prcomp() function to find the principal components:

## Step 3 - perform PCA on the standardized data
P <- prcomp(USArrests_std)
P$rotation
##                 PC1        PC2        PC3         PC4
## Murder   -0.5358995 -0.4181809  0.3412327  0.64922780
## Assault  -0.5831836 -0.1879856  0.2681484 -0.74340748
## UrbanPop -0.2781909  0.8728062  0.3780158  0.13387773
## Rape     -0.5434321  0.1673186 -0.8177779  0.08902432
  • \(PC_1\) is the dimension of maximum variation in the data. It is primarily a combination of “Murder”, “Assault”, and “Rape” - so we might label it as “overall violent crime”.
  • \(PC_2\) is the second principal component, it is orthogonal (independent) of \(PC_1\) and it’s largely based on “UrbanPop” (and “Murder” to a lesser extent) - we might label it as “Urbanness”.

Next, let’s quickly see how we could have used linear algebra instead of prcomp():

n <- nrow(USArrests)  ## find n obs
C <- 1/(n-1) * t(USArrests_std) %*% USArrests_std  ## find Cov matrix
P2 <- svd(C)  ## singular value decomposition (C = UDV^T)
P2$u        ## rotation matrix
##            [,1]       [,2]       [,3]        [,4]
## [1,] -0.5358995  0.4181809 -0.3412327  0.64922780
## [2,] -0.5831836  0.1879856 -0.2681484 -0.74340748
## [3,] -0.2781909 -0.8728062 -0.3780158  0.13387773
## [4,] -0.5434321 -0.1673186  0.8177779  0.08902432

Finally, let’s find the scores of each state in \(PC_1\) and \(PC_2\) and graph the states in these new dimensions:

scores <- as.data.frame(P$x)
scores$State <- rownames(P$x) ### row names were used to store states
ggplot(data = scores, aes(x = PC1, y = PC2, label = State)) + geom_text()

We can look at a few states to help us further understand the usefulness of PCA:

  1. California has a large, negative score in \(PC_1\) and a high score in \(PC_2\). This means it has high rates of violent crime (notice the negative loadings) and a high degree of “urbanness”.
  2. Florida has a large, negative score in \(PC_1\) and a score of nearly zero in \(PC_2\), thus it has high rates of violent crime and is middle of the pack in “urbanness”.
  3. Vermont has a large, positive score in \(PC_1\) and a moderate positive score in \(PC_2\), thus has low amounts of violent crime and is low is “urbanness”.

Question #1: Describe the states South Carolina and Wisconsin using the graph displayed above.

Practice

The code below creates a subset of the “colleges2019” data set with no missing data in the variables that are selected:

colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
cd <- colleges %>% 
  select(Name, State, Enrollment, Adm_Rate, ACT_median, 
         Net_Tuition, FourYearComp_Males, FourYearComp_Females,
        Debt_median, Salary10yr_median)
cd <- cd[complete.cases(cd),]  ## Keeps only complete cases (no missing values)

In the questions that follow you will progress through the basic steps of PCA:

Question #2 (Part A): Create a new data frame where all the numeric variables are standardized. To do this you must remove the “Name” and “State” columns.

Question #2 (Part B): Perform PCA using the prcomp() function. Then, considering only loadings above 0.4 in absolute value (a commonly used threshold), come up with a label for the first principal component.

Question #2 (Part C): Considering only loadings above 0.4 in absolute value (a commonly used threshold), come up with a label for the second principal component.

Question #2 (Part D): Using geom_text() create a graphic that displays the \(PC_1\) and \(PC_2\) scores of all colleges in Iowa. Briefly comment on the position of Grinnell College in this graph, mentioning why you think its location is so far from the other Iowa colleges.

\(~\)

When to consider PCA?

Principal component analysis is useful when your data contain several numeric variables that are related with each other such that they might be measuring variation in the same underlying factor.

The first step in determining if PCA is viable is to look at the pairwise relationships between variables. The corrplot package provides several useful functions for this. Demonstrated below are a few of the package’s functions:

## Create the correlation matrix (dropping identifiers)
cormat = cor(select(cd, -Name,-State))

## Default Corrplot
corrplot(cormat)

## Mixed corrplot 
corrplot.mixed(cormat, tl.pos = 'lt', tl.cex = 0.5, number.cex = 0.55)

In the second example, the argument tl.pos moves the text labels to the left and top, “tl”. You could also place them on the diagonal using “d”. The arguments tl.cex and number.cex change the size of the text labels and correlation coefficients (respectively).

Question #3: Notice the grouping of variables with dark blue circles in the center of the corrplot. How does this grouping relate to the first principal component you described in Question 2? How would explain this relationship?

\(~\)

Variable Contributions

Principal components are linear combinations of the original variables. And if the data are standardized, a variable’s loading is a direct reflection of its importance.

The fviz_contrib() function provides a convenient way to visualize variable contributions:

P <- prcomp(USArrests_std)
fviz_contrib(P, choice = "var", axes = 1, top = 4)  ## PC1

fviz_contrib(P, choice = "var", axes = 2, top = 3)  ## PC2

The horizontal reference line shows the expected loading if each variable contributed uniformly to all of the principal components. Any variable whose loading is the above the reference line can be viewed as “important” in that dimension.

Question #4: Use fviz_contrib() to display the contributions of variables to the first principal component in the PCA of the “colleges” data you perform in Question #2. Do the variables that appear above the reference line correspond with the ones you mentioned in Part B of Question #2? Briefly explain.

\(~\)

Explained Variation

Principal components are constructed so that earlier components explain more variability in the data than the later components. Thus, not all components are equally valuable - but how should we decide which to keep and interpret and which to discard?

Recall that the matrix \(\mathbf{L}\) contains the eigenvalues of the covariance matrix. These values must sum to \(p\) (the number of variables in the original data), so if we divide them by \(p\) we get a measure of variation explained by a component.

Note that the prcomp() function will report the square root of the eigenvalues:

P <- prcomp(USArrests_std)  ## US Arrests data
sum(P$sdev^2)               ## Recall there were 4 variables
## [1] 4

A popular graphic for visualizing explained variation is the scree plot:

fviz_screeplot(P)

The x-axis displays each principal component, while the y-axis displays the corresponding percentage of variance explained by that component. For the “USArrests” data, the first two principal components account for nearly 90% of variation across different states.

We can find the precise y-values of this graph ourselves by dividing the eigenvalues by \(p\) (the number of columns of the original data):

P$sdev^2/ncol(USArrests_std)
## [1] 0.62006039 0.24744129 0.08914080 0.04335752

Question #5: Construct a scree plot for the PCA you performed on the “colleges” data in Question #2. Next, use the PCA results to find the exact proportion of variation explained by the first three components (ie: calculate and sum the y-values displayed on your scree plot for \(PC_1\), \(PC_2\), and \(PC_3\)).

\(~\)

Parallel Analysis

Depending on the goals of an analysis, you might subjectively choose to retain a certain number of principal components by inspecting the scree plot.

Alternatively, some suggest retaining all components that explain at least as much variation as a variable in the original data. This corresponds to an eigenvalue \(\geq 1\) or explained variation \(\geq 1/p\).

While popular, neither these methods are particularly rigorous. A more reliable way to determine the number of components to retain is parallel analysis.

Parallel analysis is a statistical approach that compares the eigenvalues of the real data to a distribution of those found in randomly generated data of the same size (think of a randomization test from Sta-209). We can perform parallel analysis using the fa.parallel() function in the psych library:

fa.parallel(USArrests_std, fa = "pc")  ## fa = "pc" indicates PCA

## Parallel analysis suggests that the number of factors =  NA  and the number of components =  1

For the “USArrests” data, parallel analysis suggests we should only retain the first principal component (which we previously labeled “overall violent crime”).

Question #6: Perform parallel analysis on the “colleges” data that you’ve used in previous several questions. Report the number of components that the method suggests should be retained.

\(~\)

Biplots

A biplot is a commonly used graph that combines the scores of individual observations and the loadings of the original variables on two of the principal component axes.

fviz_pca_biplot(P, axes = c(1,2), repel = TRUE)  ## PC1 and PC2 for USArrests

Notice how the argument repel = TRUE can be used to prevent text labels from overlapping.

\(~\)

Lab Part 2: Partitional Clustering

K-means in R

In R, the kmeans() function is used to perform \(k\)-means clustering. For example, we can use the code below to assign states in the USArrests data into two clusters:

data("USArrests")
k2 <- kmeans(USArrests, centers = 2, nstart = 25)

The centers argument is the number of clusters we want (ie: \(k\)), while nstart is a seed for the random starting point of the algorithm. There is no guarantee that k-means finds a globally optimal set of cluster assignments, so nstart is used to ensure reproducibility. That said, in most circumstances the starting point has little or no impact on the clustering results.

Next, notice that kmeans() outputs a list containing 9 different objects. The code below prints the coordinates of the cluster centroids (multi-dimensional means), which are stored in the centers object within that list:

k2$centers
##      Murder  Assault UrbanPop     Rape
## 1 11.857143 255.0000 67.61905 28.11429
## 2  4.841379 109.7586 64.03448 16.24828

These centroids are often called prototypes since each of them is a central point that can be used to describe the average characteristics of that cluster’s members.

If we are interested in the cluster assignments of individual data-points, we can find them in the cluster object returned by kemans(). The code below demonstrates how to find the assigned clusters for the states “Iowa” and “Illinois”, which were row names in USArrests data frame.

k2$cluster[c("Iowa", "Illinois")]
##     Iowa Illinois 
##        2        1

Question #7: Apply \(k\)-means clustering with \(k = 4\) to the USArrests data. Then, find the centroid prototype of the cluster that contains Minnesota. How does the cluster containing Minnesota compare to the other clusters (in terms of labels you derive from the cluster centroids)?

\(~\)

Visualizing K-means

We can visualize the results \(k\)-means using fviz_cluster(), which displays the clusters on a scatter plot that uses the first two principal component scores as the X-Y coordinates for each observation.

k2 <- kmeans(USArrests, centers = 2, nstart = 25)
fviz_cluster(k2, data = USArrests)

Notice that fviz_cluster() labels each data-point using the rownames of the data object. You should recall that you can change these using the rownames() function. It’s also worth knowing that you can also use the argument repel = TRUE (in fviz_cluster()) to prevent overlapping labels.

Finally, instead of using PC1 and PC2, you can specify any of the original variables:

fviz_cluster(k2, data = USArrests, choose.vars = c("UrbanPop", "Murder"), stand = FALSE, repel = TRUE, title = "Without Standardization")

Note: This code broke with the latest R/GGplot2 package updates. If I can find a solution, I’ll fix it this week. Here is a quick GGplot to recreate, but it does not show the region shapes

USArrests %>% mutate(cluster = factor(k2$cluster)) %>%

ggplot(., aes(x = UrbanPop, y = Murder, color = cluster,)) +
  geom_point(size = 3) +
  geom_point(data = as.data.frame(k2$centers),
             aes(x = UrbanPop, y = Murder),
             color = "black", size = 5, shape = 8) +
  labs(title = "Without Standardization") +
  theme_minimal()

You should note that clusters will generally appear most distinct when viewed using the PC1 and PC2 axes.

Question #8 (Part A): Use the scale() function to standardize the USArrests data, then use fviz_cluster() to graph the \(k\)-means clusters (for \(k = 2\)) found using the standardized data with “UrbanPop” and “Murder” as the \(X\) and \(Y\) variables (similar to the graph in the previous example).

Question #8 (Part B): Explain why the clusters are somewhat different after standardizing the data. Do you think the data should be standardized prior to clustering? Briefly explain.

\(~\)

Choosing \(k\)

As previously mentioned, the \(k\)-means algorithm requires the user to supply the value of \(k\) before the algorithm can be run.

As a general guideline, \(k\) should be large enough that clusters are relatively homogeneous, but it should also be small enough to avoid unnecessary complexity. The next few sections will cover three different methods for choosing a reasonable value of \(k\).

\(~\)

The Elbow Method

The \(k\)-means algorithm solves for centroids that minimize “Total withiness” or \(\sum_{k=1}^{k} W(C_k) = \sum_{k=1}^{k} \sum_{x_i \in C_k}(x_i - \mu_k)^2\). This quantity always decreases as \(k\) increases, but there are diminishing returns. Eventually, adding more cluster centroids won’t lead to much improvement in Total Withiness.

With this relationship in mind, we can graph “Total withiness”, abbreviated “wss”, versus different choices of \(k\) using the fviz_nbclust() function:

fviz_nbclust(USArrests, FUNcluster = kmeans, method = "wss", k.max = 8)

The Elbow Method suggests the optimal number of clusters corresponds with this plot’s “elbow”, or the point where the curve starts flatting/bending towards more gradual decreases in wss as \(k\) increases.

For the USArrests data (shown in this example), the elbow method suggests either 2 or 3 clusters would be reasonable, as going further doesn’t lead to much of a decrease in wss.

\(~\)

The Silhouette Method

The Silhouette Method measures the fit of observations within their assigned clusters using quantities known as silhouette values, which are a comparison of the similarity of an observation to its assigned cluster (cohesion) relative to its similarity to other clusters (separation).

  • Cohesion is defined as the mean distance of a data-point to the other members of its assigned cluster
  • Separation is defined as the mean distance of a data-point to the members of the next closest cluster.

The difference between these two quantities (the measure of cohesion and the measure of separation) is standardized to produce a silhouette value between -1 to +1, with high values indicating an observation has “good fit” in its assigned cluster, and low values indicating “poor fit” within the assigned cluster.

If you’d like a more mathematical description of how silhouette values are calculated, the Wikipedia page is fairly accessible.

The average silhouette (across the entire data set) can be used to evaluate the effectiveness of different values of \(k\) in creating clusters that are distinctive and cohesive. We can visualize this relationship by supplying the “silhouette” method to fviz_nbclust().

fviz_nbclust(USArrests, FUNcluster = kmeans, method = "silhouette", k.max = 8)

For the USArrest data, the average silhouette metric suggests that \(k = 2\) is optimal.

Silhouettes are also useful for evaluating the cluster membership of individual observations:

k2 <- kmeans(USArrests, centers = 2, nstart = 25)
sil <- silhouette(k2$cluster, dist(USArrests), ordered = FALSE)
row.names(sil) <- row.names(USArrests) # Needed to use label option
fviz_silhouette(sil, label = TRUE)
##   cluster size ave.sil.width
## 1       1   29          0.60
## 2       2   21          0.58

Here we see there are a handful of states that aren’t great fits in their assigned clusters, so we’d want to avoid over-interpreting their assignments when sharing our results.

If we’d like to take a closer look, we can always filter the silhouette information:

## Filter to show low sil states
as.data.frame(sil) %>% filter(sil_width < 0.3)
##              cluster neighbor  sil_width
## Arkansas           2        1 0.11160017
## Missouri           1        2 0.06586521
## Rhode Island       1        2 0.17099027
## Tennessee          2        1 0.09530834

Note that the output of the silhouette() function is not a data frame, so we must coerce it before filtering.

Question #9: For each part of the following question you should use cds, which is a standardized subset of the college scorecard data created by the code given below:

## Keep this code as pre-processing
colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
cd <- colleges %>% filter(State == "IA") %>%
  select(Name, Enrollment, Adm_Rate, ACT_median, 
         Net_Tuition, FourYearComp_Males, FourYearComp_Females,
        Debt_median, Salary10yr_median)
cd <- cd[complete.cases(cd),]  ## Keeps only complete cases (no missing values)
rownames(cd) <- cd$Name

## Use the data below for this question
cds <- scale(select(cd, -Name))

Part A: Using the Elbow Method, determine an appropriate number of \(k\)-means clusters for these data. Then, report the centroids of each cluster. Note: because the data have been standardized, the units describing these centroids are standard deviations.

Part B: Use the Silhouette Method to determine an appropriate number of \(k\)-means clusters for these data.

Part C: Create a silhouette plot using the value of \(k\) selected in Part B. Which college(s) have negative silhouettes? Briefly explain what these values say about these colleges and their cluster membership.

\(~\)

The Gap Statistic

The gap statistic is a statistical approach that compares the observed wss for a certain choice of \(k\) versus a distribution of wss that would be expected if there were no actual relationships between observations (ie: a null hypothesis of no clustering). A larger gap statistic indicates a greater difference between the real data and what could be explained by random chance.

For additional mathematical details, see this paper.

The code below calculates the gap statistic for \(k=1\) through \(k=5\) for the “USArrests” data:

fviz_nbclust(USArrests, FUNcluster = kmeans, method = "gap", k.max = 5)

By default, fviz_nbclust will use two criteria when recommending a value of \(k\) using the gap statistic:

  1. The recommended \(k\) cannot appear after a decrease in the gap statistic (so only \(k \in \{1, \ldots, 4\}\) are eligible in this example).
  2. Among the eligible values, the recommendation is the smallest \(k\) where the gap statistic is within one standard error of its maximum value.

The aim is to provide more parsimonious results (fewer clusters) while still achieving performance that isn’t significantly different from the maximum.

Remark: You’ll notice in the USArrests example that each method suggested a slightly different number of clusters. This is a common occurrence, and the choice of \(k\) ultimately comes down to a decision made by the analyst. You should view the elbow, silhouette, and gap statistic as different tools that can help guide your choice.

Question #10: Use the gap statistic to determine the optimal choice of \(k\) for the cds data set used in Question #9. Does this method agree with the Elbow and Silhouette Methods? Briefly explain.

\(~\)

PAM Clustering

\(k\)-means clustering is a straightforward and widely used method, but it has a few weaknesses:

  1. As a measure of central tendency, the mean is sensitive to outliers, so \(k\)-means centroids tend to be strongly influenced by unusual data-points.
  2. The centroids found by \(k\)-means aren’t actual data-points, so using them to describe the typical members of a cluster isn’t always accurate, especially when the number of variables is large and the data contain outliers.
  3. Distances that rely upon \(x_i - \mu_k\) aren’t defined for categorical variables, so \(k\)-means will only work when all variables are numeric.

Partitioning Around Medoids (PAM) is an alternative clustering approach that searches for \(k\) data-points called medoids, which serve as cluster centers. The remaining data-points are assigned to the cluster defined by the nearest medoid.

usarrests_std <- scale(USArrests)
us_pam <- pam(usarrests_std, k = 3)  ## Apply PAM clustering
us_pam$medoids  ## Print the medoids
##                   Murder    Assault   UrbanPop       Rape
## New Mexico     0.8292944  1.3708088  0.3081225  1.1603196
## Oklahoma      -0.2727580 -0.2371077  0.1699510 -0.1315342
## New Hampshire -1.3059321 -1.3650491 -0.6590781 -1.2525644
fviz_cluster(us_pam) ## Plot the clusters

PAM addresses many of the drawbacks of \(k\)-means. It is more robust to outliers, and the cluster centers are themselves data-points, making the results more interpretable. And, we’ll later see PAM can be used to cluster data with categorical variables.

The biggest weakness of PAM is its computational efficiency. PAM doesn’t scale well to large datasets due to the complex constraints on the medoid update step. Additionally, PAM tends to be more sensitive to the starting medoids than k-means is to the starting centroids.

Question #10: Apply PAM clustering to the cds data set used in Questions #8 and #9. Your analysis should choose \(k\) using one of the three approaches from the previous section, and you should report the prototypes (cluster medoids) for each cluster.

Lab Part 3: Hierarchical Clustering

Hierarchical Clustering Algorithms

To begin, we’ll introduce two basic philosophies behind different types of hierarchical clustering algorithms:

  • Agglomerative, or “bottom-up” approaches, that start with each data-point as its own cluster (leaf) and combine similar clusters (branches) until all of the data-points are grouped together in single cluster (root).
  • Divisive, or “top-down” approaches, that start all data-points in a single cluster (root), then divide that cluster into smaller clusters (branches). The resulting branches are further divided until each data-point is its own cluster (leaf).

The acronyms AGNES (AGglomerative NESting) and DIANA (DIvisive ANAlysis) describe two of the most popular methods. The code below gives an example of each:

D <- daisy(animals, metric = "gower")

ag <- agnes(D)
fviz_dend(ag, k = 3) + labs(title = "Clustering Animals (AGNES)")

di <- diana(D)
fviz_dend(di, k = 3) + labs(title = "Clustering Animals (DIANA)")

Notice that AGNES and DIANA are greedy algorithms (meaning each cluster merge/split is locally optimal, without any consideration of the overall hierarchy). Consequently, agglomerative and divisive approaches will usually produce different dendrograms.

Specific details of the AGNES and DIANA algorithms are summarized below:

AGNES Algorithm:

At each step, AGNES first identifies the pair of most similar clusters using one of the following methods:

  • method = "average" - the average distance between all data-points in one cluster relative to those in another cluster
  • method = "single" - the smallest distance between a single data-point in one cluster relative to another cluster (nearest neighbor)
  • method = "complete - the smallest instance of the largest distance between a single data-point in one cluster relative to another cluster (furthest neighbor)

The two clusters deemed most similar (ie: smallest metric according to the selected method) are merged and the algorithm moves on to determine the next merge. This is done until all clusters have been merged.

DIANA Algorithm:

  • DIANA begins by selecting the existing cluster with the largest diameter, which is defined as the average pairwise distance among it’s members.
  • Within the selected cluster, DIANA locates the most disparate observation, which is defined by having the largest average pairwise distance from the other members of the cluster.
  • The most disparate observation is moved out of the cluster into a “splinter group”, which is the starting point of a new cluster (branch).
  • Members of the original cluster are iteratively assigned to the “splinter group” or “original party” (initial cluster) using the specified distance metric (usually average pairwise distance)

This process is repeated until each data-point is it’s own cluster.

\(~\)

Question #11: The code below applies the AGNES algorithm with method = "single" to the USArrests data using the matrix of euclidean distances calculated after the data are standardized. For Question #11, use the provided code and its output to answer the following questions:

Part A: How many clusters were present at the initialization (start) of this clustering algorithm?

Part B: What was the first merge/split that occurred? Use the distance matrix, D, to briefly justify why this merge/split happened first. Hint: min(D) will return the distance between the closest two data-points. You can confirm that this distance corresponds to the merge/split you identified by using View(as.matrix(D)).

Part C: If this dendrogram were cut to obtain \(k = 3\) clusters, how many members would belong to the largest of these clusters? Note: You may answer this question by inspecting the given output without writing any of your own code.

data("USArrests")
USArrests_std <- scale(USArrests)
D <- daisy(USArrests_std, metric = "euclidean")

ag1 <- agnes(D, method = "single")
fviz_dend(ag1, cex = 0.25) + labs(title = 'AGNES, Method = "Single"')

\(~\)

Question #12: The code below applies the DIANA algorithm to the distance matrix calculated from the standardized USarrests data. Using the code and output provided below, answer the following questions:

Part A: How many clusters were present at the initialization (start) of this algorithm?

Part B: Briefly describe the merge/split resulting from the first step of the algorithm.

Part C: Which observation you believe was used to form the “splinter group” in the first step of the algorithm? Justify your answer. Hint: Because each row/column of the distance matrix corresponds to an observation, you might look at row or column means. Before doing so, you might need to coerce D into a matrix using as.matrix().

di1 <- diana(D)
fviz_dend(di1, cex = 0.25) + labs(title = 'DIANA')

\(~\)

Where to cut?

Dendrograms provide a unique perspective on the relationships between data-points, but oftentimes we’d still like to report cluster membership for an optimized choice of \(k\).

Fortunately, all of the methods described in our previous clustering lab can also be applied to hierarchical methods. Unfortunately, some of them will require all of our variables are numeric.

The code below demonstrates these three methods using DIANA clustering on the US Arrests data described in Question #12.

### The elbow method
fviz_nbclust(USArrests_std, FUNcluster = hcut, hc_func = "diana", method = "wss")

### The average silhouette method
fviz_nbclust(USArrests_std, FUNcluster = hcut, hc_func = "diana", method = "silhouette")

## The gap statistic method
fviz_nbclust(USArrests_std, FUNcluster = hcut, hc_func = "diana", method = "gap", 
             k.max = 6, verbose = FALSE)

The argument FUNcluster = hcut indicates the use of hierarchical clustering that is to be cut into clusters, while the argument hc_func specifies the specific hierarchical clustering algorithm that should be used.

Question #13: The code below creates cds, a standardized subset of the Colleges 2019 data set that was used in the partitional clustering portion of the lab. For this question, use the average silhouette method to find the optimal value of \(k\) for AGNES clustering performed on these data. Then, create a dendrogram displaying the AGNES clustering results with branches colored according to the value of \(k\) you selected. You should use euclidean distance for this analysis.

colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
cd <- colleges %>% filter(State == "IA") %>%
  select(Name, Enrollment, Adm_Rate, ACT_median, 
         Net_Tuition, FourYearComp_Males, FourYearComp_Females,
        Debt_median, Salary10yr_median)
cd <- cd[complete.cases(cd),]  ## Keeps only complete cases (no missing values)
rownames(cd) <- cd$Name
cds <- scale(select(cd, -Name))

\(~\)

Lab Part 4: Practice

PCA

In this example you’ll explore data collected from an online personality test. These data measure the big-five personality traits, which are: “openness to experiences”, “conscientiousness”, “extraversion”, “agreeableness”, and “neuroticism”.

The personality test asked participants to rate 50 statements on a five-point scale (1 = disagree, 3 = neutral, 5 = agree). The researchers designed 10 statements to target each of the big five traits. For example, questions tagged with “E” were designed to measure extraversion, while questions tagged with “A” were designed to measure agreeableness.

The exact questions used, along with a more detailed description of the data, can be found here: codebook link

big5 <- read.delim("https://remiller1450.github.io/data/big5data.csv")
b5 <- select(big5, -race, -age, -engnat, -gender, -hand, -source, -country) ## Exclude demographics

Question #14 (Part A): Use the prcomp() function to perform PCA on the big five data set. Be sure to standardize the data. Then, use parallel analysis to determine the number of components that should be retained.

Question #14 (Part B) Construct a scree plot to display the PCA results from Part A and report the variation explained by the first five principal components (the big five personality traits).

Question #14 (Part C): Use fviz_contrib() with the argument top = 10 to visualize the top 10 questions that contribute to the first principal component. What label would you give this component?

Question #14 (Part D): Next, use fviz_contrib() on components 2 through 5 (you don’t need to record these results). How would label these principal components?

Question #14 (Part E) Create a data frame of scores in components 1 through 5 and add the “gender” variable from the original data to this data frame. Then, use group_by() and summarize() to find the average big five personality factor scores of male and female respondents (1 = Male, 2 = Female). Which two personality factors appear to be the most related to gender? Which two personality factors appear to be the least related to gender? Briefly explain.

Partitional Clustering

Question #15: The “mlb” data set contains offensive statistics for all Major League Baseball players with at least 200 at bats during the 2022 season. Aside from the player ID and “AB” (number of at bats), all other columns are reported as rates (per at bat).

mlb <- read.csv("https://remiller1450.github.io/data/mlb_hitters22.csv")

For those unfamiliar with baseball, here are some brief definitions of the remaining variables:

  • R - runs scored per at bat
  • H - hit rate (per at bat)
  • X2B - double rate (per at bat)
  • HR - home run rate (per at bat)
  • RBI - runs batted in per at bat
  • SB - stolen bases (scaled by number of batting attempts)
  • CS - caught stealing rate (per at bat)
  • BB - bases on balls (walks) per at bat
  • SO - strike outs per at at bat

Part A: Create a new data frame, mlb_std, where the player IDs are used as row names, the variable “AB” is removed, and all other variables have been standardized by the scale() function.

Part B: These data were standardized prior to you receiving them to reflect the rate of each event per batting attempt. Given this standardization has taken place, briefly explain why the data should be further standardized (into Z-scores by subtracting each variable’s mean and scaling by its standard deviation) prior to clustering.

Part C: Use the Silhouette Method to determine the number of clusters that should be used when applying \(k\)-means clustering to these data. Then, apply the \(k\)-means algorithm and print the cluster centroids.

Part D: Use the cluster centroids from Part B to provide a brief description of the members of Cluster #1 and Cluster #2. That is, how would you characterize the hitters in each of these clusters?

Part E: Suppose a sports columnist wants to write an article titled “7 different types of hitters”. Use PAM clustering to find a prototype players to use in this article (you use their Player ID). Then, report the number of hitters belonging to each of these clusters.

Part F: Use the silhouette() function to find the silhouette values for each hitter using the clusters from Part D. Then, report the player with the lowest value (worst fit) for each cluster. Hint: Use as.data.frame() to coerce the output of silhoutte() to a data.frame object, then add the Player IDs. Next, use group_by() and filter() to find the minimum silhouettes in each cluster.

Hierarchical Clustering

The aim of this section is to provide you an opportunity to apply clustering algorithms to new data sets in order to practice identifying scenarios where clustering might be a useful analysis approach.

Question #16: The “European foods” data set summarizes the results of a household survey administered to 16 European nations in 1975. The survey asked households whether they had various foods in their house at the time of the questionnaire. The data below display the percentage of households in each country that had the food in question (expressed as a numeric value between 0 and 100).

euro <- read.csv("https://remiller1450.github.io/data/europeanfoods.csv", row.names = 1, stringsAsFactors = TRUE)

Part A: Create a distance matrix using euclidean distance without standardizing (since all variables are already on a 0-100 scale). Then, visualize this matrix using fviz_dist(). Which country’s household food habits are most similar to Britain’s?

Part B: Use AGNES clustering with average distance as the merging method to create a dendrogram displaying these data.

Part C: Use DIANA clustering to create a dendrogram displaying these data.

Part D: Sweden, Norway, Denmark, and Finland are commonly grouped together as a region known as the Nordic Countries. Which algorithm, AGNES or DIANA, seems to do a better job capturing this grouping? Briefly explain your answer.

Part E: Use the average silhouette method to determine an optimal number of clusters for the application of AGNES clustering in Part B. Then, use this value to color the dendrogram you created in Part B.

Part F: Now use the gap statistic to determine an optimal number of clusters (again using the clustering results from Part B). Does this method agree with the average silhouette approach?

\(~\)

Question #17: The “starwars” data set documents the attributes of major characters in the Star Wars film series. For this question, you will perform a cluster analysis on the subset created below. The variables definitions are largely self-explanatory, but you can run ?starwars in the console for descriptions.

data("starwars")
sw <- starwars %>%
  select(name, height, mass, hair_color, skin_color, eye_color, birth_year, sex, gender, homeworld, species) %>% 
  mutate_if(is.character, as.factor)
sw <- as.data.frame(sw[complete.cases(sw),])
rownames(sw) <- sw$name
sw <- select(sw, -name)

Part A: After inspecting the contents of sw, briefly explain why Gower distance should be used to cluster these data.

Part B: Create a distance matrix and visualize it using fviz_dist().

Part C: Using the AGNES algorithm, create a dendrogram displaying these data.

Part D (optional, not graded): If you are familiar with Star Wars, do any relationships displayed in the dendrogram from Part C stand out as particularly interesting?