Packages and Datasets

# load the following packages
library(cluster)
library(psych)
library(factoextra)
library(ggplot2)
library(dplyr)
library(corrplot)

Partitional Clustering

Introduction

Consider data consisting of \(n\) measurements (observations) of \(p\) variables, represented by the matrix \(\mathbf{X}\):

\[ \mathbf{X} = \begin{pmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,p} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n,1} & x_{n,2} & \cdots & x_{n,p} \end{pmatrix} \]

  • PCA can be used to identify relationships in the columns of \(\mathbf{X}\) (ie: groupings/patterns among variables, how variables are related).
  • Clustering can be used to identify relationships in the rows of \(\mathbf{X}\) (ie: groupings/patterns among observations, how different cases are related).

Clustering approaches fall into two broad categories:

  • Hierarchical Clustering organizes the data-points into a nested subsets (tree-like structures)
  • Partitional Clustering divides the data-points into non-overlapping subsets (regions in space)

This lab will focus on partitional clustering, in particular \(k\)-means and PAM clustering.

\(~\)

Distance

As a motivating example, consider the “USArrests” data from the PCA lab. Shown below are two clusters (red and blue) from a partitional method.

Which cluster should contain Virginia? Why?

Good clusters contain data-points that are similar to each other (cohesive), and also dissimilar from the members of other clusters (distinct).

To assess “similarity” we’ll use distance (sometimes called a metric). There are many possible ways to define distance, one of the most popular is Euclidean distance:

\[d_{a,b} = \sqrt{\sum_{i = 1}^{p} (x_{ai} - x_{bi})^2}\]

In words, the Euclidean distance between two data-points, \(x_a\) and \(x_b\), is the square root of the sum of squared differences in their values across each variable (we index these variables by \(i \in \{1, \ldots, p\}\)).

\(~\)

K-means

Perhaps the most well-known clustering approach is k-means clustering, which groups \(n\) observations into \(k\) distinct groups (clusters) such that within cluster variation is minimized.

The original \(k\)-means algorithm solves for the \(k\) cluster centroids (\(\mu_1, \ldots, \mu_k\)) that minimize “Total Withiness”:

\[ \text{Total Withiness} = \sum_{j=1}^{k} W(C_j) = \sum_{j=1}^{k} \sum_{x_i \in C_j} (x_i - \mu_j)^2 \]

  • \(x_i\) is a data-point belonging to cluster \(k\)
  • \(\mu_j\) is the centroid (multi-dimensional mean) of cluster \(j\)
  • \(W(C_j)\) is the “withiness” of cluster \(j\), measuring the distance of each member to the cluster center

To apply \(k\)-means, the analyst must specify \(k\), the number of clusters, then the \(k\)-means algorithm is used to group the observations such that total squared distance from each data point to the center of its assigned cluster is minimized.

The \(k\)-means Algorithm
  1. Initialize the positions of \(k\) centroids (random staring point)
  2. Calculate the distance between each data-point and each centroid (using Euclidean distance)
  3. Assign data-points to the nearest centroid
  4. Update the cluster centroids (to the mean of its assigned data-points)
  5. Repeat (2) - (4) until there is no change

The gif below illustrates the progression of the \(k\)-means algorithm for \(k = 2\):

\(~\)

Distance and Categorical Variables

Clustering finds groups of similar data-points. Ideally, clusters are cohesive (all members are close to each other) with separation from each other (cluster members are far from the members of other clusters).

PArtitional Clustering (above) relies upon euclidean distance: \[d_{euclidean}(a,b) = \sqrt{\sum_{j = 1}^{p} (x_{aj} - x_{bj})^2}\]

However, this reliance is very restrictive because euclidean distance is not defined for categorical variables.

An alternative distance measure is Gower distance: \[d_{Gower}(a,b) = \tfrac{1}{p}\sum_{i = 1}^{p} h_i( x_{a,i}, x_{b,i})\]

The function \(h_i()\) differs depending upon whether \(X_i\) is numeric or categorical

  • For numeric variables, \(h_i(x_{a,i}, x_{b,i}) = \tfrac{|x_{a,i} - x_{b,i}|}{R_i}\), where \(R_i\) is the observed range of \(X_i\)
  • For categorical variables, \(h_i(x_{a,i}, x_{b,i}) = 0\) if the \(x_{a,i}\) and \(x_{b,i}\) are the same category, and \(h_i(x_{a,i}, x_{b,i}) = 1\) otherwise

Notice that \(h_i()\) has a range of \([0,1]\) regardless of whether \(X_i\) is categorical or numeric, so Gower distance implicitly standardizes our data (so we don’t need to worry about using scale() function when clustering using Gower distance).

\(~\)

Distance Matrices

A distance matrix, sometimes called a dissimilarity matrix, is an \(n\) by \(n\) matrix containing the pairwise distances between all pairs of data-points.

To better understand Gower distance and distance matrices, let’s look at a simple data set consisting of 6 yes/no questions that describe \(n = 20\) different animal species:

Warm_blooded? Can_fly? Vertebrate? Endangered? Lives_in_groups? Has_hair?
ant No No No No Yes No
bee No Yes No No Yes Yes
cat Yes No Yes No No Yes
cephalopod No No No No No Yes
chimp Yes No Yes Yes Yes Yes
cow Yes No Yes No Yes Yes
duck Yes Yes Yes No Yes No
eagle Yes Yes Yes Yes No No
elephant Yes No Yes Yes Yes No
fly No Yes No No No No
frog No No Yes Yes NA No
hermit No No Yes No Yes No
lion Yes No Yes NA Yes Yes
lizard No No Yes No No No
lobster No No No No NA No
human Yes No Yes Yes Yes Yes
rabbit Yes No Yes No Yes Yes
salamander No No Yes No NA No
spider No No No NA No Yes
whale Yes No Yes Yes Yes No

\(~\)

We can calculate a matrix of Gower distances using the daisy() function in the cluster package:

animals <- read.csv("https://remiller1450.github.io/data/animals.csv", row.names = 1, stringsAsFactors = TRUE)
D <- daisy(animals, metric = "gower")  ## Calculate distance matrix
fviz_dist(D)  ## Visualize

We can then visualize the distance matrix using fviz_dist().

  • Why are all entries on the diagonal of this matrix exactly 0?
  • What must be true of lions (“lio”) and flys (“fly”) for their distance to be 1?

\(~\)

Lab Start: Partitional Clustering

K-means in R

In R, the kmeans() function is used to perform \(k\)-means clustering. For example, we can use the code below to assign states in the USArrests data into two clusters:

data("USArrests")
k2 <- kmeans(USArrests, centers = 2, nstart = 25)

The centers argument is the number of clusters we want (ie: \(k\)), while nstart is a seed for the random starting point of the algorithm. There is no guarantee that k-means finds a globally optimal set of cluster assignments, so nstart is used to ensure reproducibility. That said, in most circumstances the starting point has little or no impact on the clustering results.

Next, notice that kmeans() outputs a list containing 9 different objects. The code below prints the coordinates of the cluster centroids (multi-dimensional means), which are stored in the centers object within that list:

k2$centers
##      Murder  Assault UrbanPop     Rape
## 1 11.857143 255.0000 67.61905 28.11429
## 2  4.841379 109.7586 64.03448 16.24828

These centroids are often called prototypes since each of them is a central point that can be used to describe the average characteristics of that cluster’s members.

If we are interested in the cluster assignments of individual data-points, we can find them in the cluster object returned by kemans(). The code below demonstrates how to find the assigned clusters for the states “Iowa” and “Illinois”, which were row names in USArrests data frame.

k2$cluster[c("Iowa", "Illinois")]
##     Iowa Illinois 
##        2        1

Question #1: Apply \(k\)-means clustering with \(k = 4\) to the USArrests data. Then, find the centroid prototype of the cluster that contains Minnesota. How does the cluster containing Minnesota compare to the other clusters (in terms of labels you derive from the cluster centroids)?

\(~\)

Visualizing K-means

We can visualize the results \(k\)-means using fviz_cluster(), which displays the clusters on a scatter plot that uses the first two principal component scores as the X-Y coordinates for each observation.

k2 <- kmeans(USArrests, centers = 2, nstart = 25)
fviz_cluster(k2, data = USArrests)

Notice that fviz_cluster() labels each data-point using the rownames of the data object. You should recall that you can change these using the rownames() function. It’s also worth knowing that you can also use the argument repel = TRUE (in fviz_cluster()) to prevent overlapping labels.

Finally, instead of using PC1 and PC2, you can specify any of the original variables:

fviz_cluster(k2, data = USArrests, choose.vars = c("UrbanPop", "Murder"), stand = FALSE, repel = TRUE, title = "Without Standardization")

Note: This code broke with the latest R/GGplot2 package updates. If I can find a solution, I’ll fix it this week. Here is a quick GGplot to recreate, but it does not show the region shapes

USArrests %>% mutate(cluster = factor(k2$cluster)) %>%

ggplot(., aes(x = UrbanPop, y = Murder, color = cluster,)) +
  geom_point(size = 3) +
  geom_point(data = as.data.frame(k2$centers),
             aes(x = UrbanPop, y = Murder),
             color = "black", size = 5, shape = 8) +
  labs(title = "Without Standardization") +
  theme_minimal()

k3 = kmeans(scale(USArrests), centers = 2, nstart = 25)

fviz_cluster(k3, data = USArrests, choose.vars = c("UrbanPop", "Murder"), stand = TRUE, repel = TRUE, title = "With Standardization")

USArrests %>% mutate(cluster = factor(k3$cluster)) %>%

ggplot(., aes(x = UrbanPop, y = Murder, color = cluster,)) +
  geom_point(size = 3) +
  geom_point(data = as.data.frame(k2$centers),
             aes(x = UrbanPop, y = Murder),
             color = "black", size = 5, shape = 8) +
  labs(title = "With Standardization") +
  theme_minimal()

You should note that clusters will generally appear most distinct when viewed using the PC1 and PC2 axes.

Question #2 (Part A): Use the scale() function to standardize the USArrests data, then use fviz_cluster() to graph the \(k\)-means clusters (for \(k = 2\)) found using the standardized data with “UrbanPop” and “Murder” as the \(X\) and \(Y\) variables (similar to the graph in the previous example).

Question #2 (Part B): Explain why the clusters are somewhat different after standardizing the data. Do you think the data should be standardized prior to clustering? Briefly explain.

\(~\)

Choosing \(k\)

As previously mentioned, the \(k\)-means algorithm requires the user to supply the value of \(k\) before the algorithm can be run.

As a general guideline, \(k\) should be large enough that clusters are relatively homogeneous, but it should also be small enough to avoid unnecessary complexity. The next few sections will cover three different methods for choosing a reasonable value of \(k\).

\(~\)

The Elbow Method

The \(k\)-means algorithm solves for centroids that minimize “Total withiness” or \(\sum_{k=1}^{k} W(C_k) = \sum_{k=1}^{k} \sum_{x_i \in C_k}(x_i - \mu_k)^2\). This quantity always decreases as \(k\) increases, but there are diminishing returns. Eventually, adding more cluster centroids won’t lead to much improvement in Total Withiness.

With this relationship in mind, we can graph “Total withiness”, abbreviated “wss”, versus different choices of \(k\) using the fviz_nbclust() function:

fviz_nbclust(USArrests, FUNcluster = kmeans, method = "wss", k.max = 8)

The Elbow Method suggests the optimal number of clusters corresponds with this plot’s “elbow”, or the point where the curve starts flatting/bending towards more gradual decreases in wss as \(k\) increases.

For the USArrests data (shown in this example), the elbow method suggests either 2 or 3 clusters would be reasonable, as going further doesn’t lead to much of a decrease in wss.

\(~\)

The Silhouette Method

The Silhouette Method measures the fit of observations within their assigned clusters using quantities known as silhouette values, which are a comparison of the similarity of an observation to its assigned cluster (cohesion) relative to its similarity to other clusters (separation).

  • Cohesion is defined as the mean distance of a data-point to the other members of its assigned cluster
  • Separation is defined as the mean distance of a data-point to the members of the next closest cluster.

The difference between these two quantities (the measure of cohesion and the measure of separation) is standardized to produce a silhouette value between -1 to +1, with high values indicating an observation has “good fit” in its assigned cluster, and low values indicating “poor fit” within the assigned cluster.

If you’d like a more mathematical description of how silhouette values are calculated, the Wikipedia page is fairly accessible.

The average silhouette (across the entire data set) can be used to evaluate the effectiveness of different values of \(k\) in creating clusters that are distinctive and cohesive. We can visualize this relationship by supplying the “silhouette” method to fviz_nbclust().

fviz_nbclust(USArrests, FUNcluster = kmeans, method = "silhouette", k.max = 8)

For the USArrest data, the average silhouette metric suggests that \(k = 2\) is optimal.

Silhouettes are also useful for evaluating the cluster membership of individual observations:

k2 <- kmeans(USArrests, centers = 2, nstart = 25)
sil <- silhouette(k2$cluster, dist(USArrests), ordered = FALSE)
row.names(sil) <- row.names(USArrests) # Needed to use label option
fviz_silhouette(sil, label = TRUE)
##   cluster size ave.sil.width
## 1       1   29          0.60
## 2       2   21          0.58

Here we see there are a handful of states that aren’t great fits in their assigned clusters, so we’d want to avoid over-interpreting their assignments when sharing our results.

If we’d like to take a closer look, we can always filter the silhouette information:

## Filter to show low sil states
as.data.frame(sil) %>% filter(sil_width < 0.3)
##              cluster neighbor  sil_width
## Arkansas           2        1 0.11160017
## Missouri           1        2 0.06586521
## Rhode Island       1        2 0.17099027
## Tennessee          2        1 0.09530834

Note that the output of the silhouette() function is not a data frame, so we must coerce it before filtering.

Question #3: For each part of the following question you should use cds, which is a standardized subset of the college scorecard data created by the code given below:

## Keep this code as pre-processing
colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
cd <- colleges %>% filter(State == "IA") %>%
  select(Name, Enrollment, Adm_Rate, ACT_median,
         Net_Tuition, FourYearComp_Males, FourYearComp_Females,
        Debt_median, Salary10yr_median)
cd <- cd[complete.cases(cd),]  ## Keeps only complete cases (no missing values)
rownames(cd) <- cd$Name

## Use the data below for this question
cds <- scale(select(cd, -Name))

Part A: Using the Elbow Method, determine an appropriate number of \(k\)-means clusters for these data. Then, report the centroids of each cluster. Note: because the data have been standardized, the units describing these centroids are standard deviations.

Part B: Use the Silhouette Method to determine an appropriate number of \(k\)-means clusters for these data.

Part C: Create a silhouette plot using the value of \(k\) selected in Part B. Which college(s) have negative silhouettes? Briefly explain what these values say about these colleges and their cluster membership.

\(~\)

The Gap Statistic

The gap statistic is a statistical approach that compares the observed wss for a certain choice of \(k\) versus a distribution of wss that would be expected if there were no actual relationships between observations (ie: a null hypothesis of no clustering). A larger gap statistic indicates a greater difference between the real data and what could be explained by random chance.

For additional mathematical details, see this paper.

The code below calculates the gap statistic for \(k=1\) through \(k=5\) for the “USArrests” data:

fviz_nbclust(USArrests, FUNcluster = kmeans, method = "gap", k.max = 5)

By default, fviz_nbclust will use two criteria when recommending a value of \(k\) using the gap statistic:

  1. The recommended \(k\) cannot appear after a decrease in the gap statistic (so only \(k \in \{1, \ldots, 4\}\) are eligible in this example).
  2. Among the eligible values, the recommendation is the smallest \(k\) where the gap statistic is within one standard error of its maximum value.

The aim is to provide more parsimonious results (fewer clusters) while still achieving performance that isn’t significantly different from the maximum.

Remark: You’ll notice in the USArrests example that each method suggested a slightly different number of clusters. This is a common occurrence, and the choice of \(k\) ultimately comes down to a decision made by the analyst. You should view the elbow, silhouette, and gap statistic as different tools that can help guide your choice.

Question #4: Use the gap statistic to determine the optimal choice of \(k\) for the cds data set used in Question #3. Does this method agree with the Elbow and Silhouette Methods? Briefly explain.

\(~\)

PAM Clustering

\(k\)-means clustering is a straightforward and widely used method, but it has a few weaknesses:

  1. As a measure of central tendency, the mean is sensitive to outliers, so \(k\)-means centroids tend to be strongly influenced by unusual data-points.
  2. The centroids found by \(k\)-means aren’t actual data-points, so using them to describe the typical members of a cluster isn’t always accurate, especially when the number of variables is large and the data contain outliers.
  3. Distances that rely upon \(x_i - \mu_k\) aren’t defined for categorical variables, so \(k\)-means will only work when all variables are numeric.

Partitioning Around Medoids (PAM) is an alternative clustering approach that searches for \(k\) data-points called medoids, which serve as cluster centers. The remaining data-points are assigned to the cluster defined by the nearest medoid.

usarrests_std <- scale(USArrests)
us_pam <- pam(usarrests_std, k = 3)  ## Apply PAM clustering
us_pam$medoids  ## Print the medoids
##                   Murder    Assault   UrbanPop       Rape
## New Mexico     0.8292944  1.3708088  0.3081225  1.1603196
## Oklahoma      -0.2727580 -0.2371077  0.1699510 -0.1315342
## New Hampshire -1.3059321 -1.3650491 -0.6590781 -1.2525644
fviz_cluster(us_pam) ## Plot the clusters

PAM addresses many of the drawbacks of \(k\)-means. It is more robust to outliers, and the cluster centers are themselves data-points, making the results more interpretable. And, we’ll later see PAM can be used to cluster data with categorical variables.

The biggest weakness of PAM is its computational efficiency. PAM doesn’t scale well to large datasets due to the complex constraints on the medoid update step. Additionally, PAM tends to be more sensitive to the starting medoids than k-means is to the starting centroids.

Question #5: Apply PAM clustering to the cds data set used in Questions #3 and #4. Your analysis should choose \(k\) using one of the three approaches from the previous section, and you should report the prototypes (cluster medoids) for each cluster.

Question #6 (Baseball Stuff): The “mlb” data set contains offensive statistics for all Major League Baseball players with at least 200 at bats during the 2022 season. Aside from the player ID and “AB” (number of at bats), all other columns are reported as rates (per at bat).

mlb <- read.csv("https://remiller1450.github.io/data/mlb_hitters22.csv")

For those unfamiliar with baseball, here are some brief definitions of the remaining variables:

  • R - runs scored per at bat
  • H - hit rate (per at bat)
  • X2B - double rate (per at bat)
  • HR - home run rate (per at bat)
  • RBI - runs batted in per at bat
  • SB - stolen bases (scaled by number of batting attempts)
  • CS - caught stealing rate (per at bat)
  • BB - bases on balls (walks) per at bat
  • SO - strike outs per at at bat

Part A: Create a new data frame, mlb_std, where the player IDs are used as row names, the variable “AB” is removed, and all other variables have been standardized by the scale() function.

Part B: These data were standardized prior to you receiving them to reflect the rate of each event per batting attempt. Given this standardization has taken place, briefly explain why the data should be further standardized (into Z-scores by subtracting each variable’s mean and scaling by its standard deviation) prior to clustering.

Part C: Use the Silhouette Method to determine the number of clusters that should be used when applying \(k\)-means clustering to these data. Then, apply the \(k\)-means algorithm and print the cluster centroids.

Part D: Use the cluster centroids from Part B to provide a brief description of the members of Cluster #1 and Cluster #2. That is, how would you characterize the hitters in each of these clusters?

Part E: Suppose a sports columnist wants to write an article titled “7 different types of hitters”. Use PAM clustering to find a prototype players to use in this article (you use their Player ID). Then, report the number of hitters belonging to each of these clusters.

Part F: Use the silhouette() function to find the silhouette values for each hitter using the clusters from Part D. Then, report the player with the lowest value (worst fit) for each cluster. Hint: Use as.data.frame() to coerce the output of silhoutte() to a data.frame object, then add the Player IDs. Next, use group_by() and filter() to find the minimum silhouettes in each cluster.

Lab Part 2: Hierarchical Clustering

An Example

Some clustering algorithms can use the distance matrix, rather than the raw data, to identify clusters. In this part we’ll introduce hierarchical clustering, which can be applied to any distance matrix.

In contrast with prototype-based approaches like \(k\)-means or PAM (partitioning around mediods), hierarchical clustering organizes observations into a nested sets. These sets can be displayed using a tree-like structure known as a dendrogram:

ag <- agnes(D)  ## Apply a hierarchical clustering algorithm
fviz_dend(ag) + labs(title = "Clustering Animals")  ## Display the dendrogram

  • The top of the dendrogram is a single cluster containing all observations (ie: \(k = 1\))
  • The bottom of the dendrogram has every observation in its own cluster (ie: \(k = n\))
  • The y-value at which two clusters merge is a reflection of the distance/dissimilarity between the two clusters

Note that we can always obtain partitional clusters by “cutting” a dendrogram:

fviz_dend(ag, k = 4) + labs(title = "Clustering Animals (k = 4)")

In this example, we asked for \(k = 4\) clusters, so the dendrogram was cut at a height of approximately 0.4. Recall that this height indicates that the members of these clusters differ by an average Gower distance of around 0.4, so we can say that they’re relative different when considering the scale of Gower distance.

Hierarchical Clustering Algorithms

To begin, we’ll introduce two basic philosophies behind different types of hierarchical clustering algorithms:

  • Agglomerative, or “bottom-up” approaches, that start with each data-point as its own cluster (leaf) and combine similar clusters (branches) until all of the data-points are grouped together in single cluster (root).
  • Divisive, or “top-down” approaches, that start all data-points in a single cluster (root), then divide that cluster into smaller clusters (branches). The resulting branches are further divided until each data-point is its own cluster (leaf).

The acronyms AGNES (AGglomerative NESting) and DIANA (DIvisive ANAlysis) describe two of the most popular methods. The code below gives an example of each:

D <- daisy(animals, metric = "gower")

ag <- agnes(D)
fviz_dend(ag, k = 3) + labs(title = "Clustering Animals (AGNES)")

di <- diana(D)
fviz_dend(di, k = 3) + labs(title = "Clustering Animals (DIANA)")

Notice that AGNES and DIANA are greedy algorithms (meaning each cluster merge/split is locally optimal, without any consideration of the overall hierarchy). Consequently, agglomerative and divisive approaches will usually produce different dendrograms.

Specific details of the AGNES and DIANA algorithms are summarized below:

AGNES Algorithm:

At each step, AGNES first identifies the pair of most similar clusters using one of the following methods:

  • method = "average" - the average distance between all data-points in one cluster relative to those in another cluster
  • method = "single" - the smallest distance between a single data-point in one cluster relative to another cluster (nearest neighbor)
  • method = "complete - the smallest instance of the largest distance between a single data-point in one cluster relative to another cluster (furthest neighbor)

The two clusters deemed most similar (ie: smallest metric according to the selected method) are merged and the algorithm moves on to determine the next merge. This is done until all clusters have been merged.

DIANA Algorithm:

  • DIANA begins by selecting the existing cluster with the largest diameter, which is defined as the average pairwise distance among it’s members.
  • Within the selected cluster, DIANA locates the most disparate observation, which is defined by having the largest average pairwise distance from the other members of the cluster.
  • The most disparate observation is moved out of the cluster into a “splinter group”, which is the starting point of a new cluster (branch).
  • Members of the original cluster are iteratively assigned to the “splinter group” or “original party” (initial cluster) using the specified distance metric (usually average pairwise distance)

This process is repeated until each data-point is it’s own cluster.

\(~\)

Question #7: The code below applies the AGNES algorithm with method = "single" to the USArrests data using the matrix of euclidean distances calculated after the data are standardized. For this question, use the provided code and its output to answer the following questions:

Part A: How many clusters were present at the initialization (start) of this clustering algorithm?

Part B: What was the first merge/split that occurred? Use the distance matrix, D, to briefly justify why this merge/split happened first. Hint: min(D) will return the distance between the closest two data-points. You can confirm that this distance corresponds to the merge/split you identified by using View(as.matrix(D)).

Part C: If this dendrogram were cut to obtain \(k = 3\) clusters, how many members would belong to the largest of these clusters? Note: You may answer this question by inspecting the given output without writing any of your own code.

data("USArrests")
USArrests_std <- scale(USArrests)
D <- daisy(USArrests_std, metric = "euclidean")

ag1 <- agnes(D, method = "single")
fviz_dend(ag1, cex = 0.25) + labs(title = 'AGNES, Method = "Single"')

\(~\)

Question 8: The code below applies the DIANA algorithm to the distance matrix calculated from the standardized USarrests data. Using the code and output provided below, answer the following questions:

Part A: How many clusters were present at the initialization (start) of this algorithm?

Part B: Briefly describe the merge/split resulting from the first step of the algorithm.

Part C: Which observation you believe was used to form the “splinter group” in the first step of the algorithm? Justify your answer. Hint: Because each row/column of the distance matrix corresponds to an observation, you might look at row or column means. Before doing so, you might need to coerce D into a matrix using as.matrix().

di1 <- diana(D)
fviz_dend(di1, cex = 0.25) + labs(title = 'DIANA')

\(~\)

Where to cut?

Dendrograms provide a unique perspective on the relationships between data-points, but oftentimes we’d still like to report cluster membership for an optimized choice of \(k\).

Fortunately, all of the methods described in our previous clustering lab can also be applied to hierarchical methods. Unfortunately, some of them will require all of our variables are numeric.

The code below demonstrates these three methods using DIANA clustering on the US Arrests data described in Question 8.

### The elbow method
fviz_nbclust(USArrests_std, FUNcluster = hcut, hc_func = "diana", method = "wss")

### The average silhouette method
fviz_nbclust(USArrests_std, FUNcluster = hcut, hc_func = "diana", method = "silhouette")

## The gap statistic method
fviz_nbclust(USArrests_std, FUNcluster = hcut, hc_func = "diana", method = "gap",
             k.max = 6, verbose = FALSE)

The argument FUNcluster = hcut indicates the use of hierarchical clustering that is to be cut into clusters, while the argument hc_func specifies the specific hierarchical clustering algorithm that should be used.

Question 9: The code below creates cds, a standardized subset of the Colleges 2019 data set that was used in the partitional clustering portion of the lab. For this question, use the average silhouette method to find the optimal value of \(k\) for AGNES clustering performed on these data. Then, create a dendrogram displaying the AGNES clustering results with branches colored according to the value of \(k\) you selected. You should use euclidean distance for this analysis.

colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
cd <- colleges %>% filter(State == "IA") %>%
  select(Name, Enrollment, Adm_Rate, ACT_median,
         Net_Tuition, FourYearComp_Males, FourYearComp_Females,
        Debt_median, Salary10yr_median)
cd <- cd[complete.cases(cd),]  ## Keeps only complete cases (no missing values)
rownames(cd) <- cd$Name
cds <- scale(select(cd, -Name))

Hierarchical Clustering Practice (Required)

The aim of this section is to provide you an opportunity to apply clustering algorithms to new data sets in order to practice identifying scenarios where clustering might be a useful analysis approach.

Question 10: The “European foods” data set summarizes the results of a household survey administered to 16 European nations in 1975. The survey asked households whether they had various foods in their house at the time of the questionnaire. The data below display the percentage of households in each country that had the food in question (expressed as a numeric value between 0 and 100).

euro <- read.csv("https://remiller1450.github.io/data/europeanfoods.csv", row.names = 1, stringsAsFactors = TRUE)

Part A: Create a distance matrix using euclidean distance without standardizing (since all variables are already on a 0-100 scale). Then, visualize this matrix using fviz_dist(). Which country’s household food habits are most similar to Britain’s?

Part B: Use AGNES clustering with average distance as the merging method to create a dendrogram displaying these data.

Part C: Use DIANA clustering to create a dendrogram displaying these data.

Part D: Sweden, Norway, Denmark, and Finland are commonly grouped together as a region known as the Nordic Countries. Which algorithm, AGNES or DIANA, seems to do a better job capturing this grouping? Briefly explain your answer.

Part E: Use the average silhouette method to determine an optimal number of clusters for the application of AGNES clustering in Part B. Then, use this value to color the dendrogram you created in Part B.

Part F: Now use the gap statistic to determine an optimal number of clusters (again using the clustering results from Part B). Does this method agree with the average silhouette approach?

\(~\)

Question 11: The “starwars” data set documents the attributes of major characters in the Star Wars film series. For this question, you will perform a cluster analysis on the subset created below. The variables definitions are largely self-explanatory, but you can run ?starwars in the console for descriptions.

data("starwars")
sw <- starwars %>%
  select(name, height, mass, hair_color, skin_color, eye_color, birth_year, sex, gender, homeworld, species) %>%
  mutate_if(is.character, as.factor)
sw <- as.data.frame(sw[complete.cases(sw),])
rownames(sw) <- sw$name
sw <- select(sw, -name)

Part A: After inspecting the contents of sw, briefly explain why Gower distance should be used to cluster these data.

Part B: Create a distance matrix and visualize it using fviz_dist().

Part C: Using the AGNES algorithm, create a dendrogram displaying these data.

Part D (optional, not graded): If you are familiar with Star Wars, do any relationships displayed in the dendrogram from Part C stand out as particularly interesting?