STA230 - Homework #6

Directions

Please document your answers to all homework questions using R Markdown, submitting your compiled output on Gradescope.

\(~\)

Question #1

Unsupervised learning approaches, including as PCA, are frequently used in the analysis of genomic data, which often contain thousands of genetic variables.

The dataset NCI60 in the ISLR package contains 6830 gene expression measurements (variables) for 64 cancer cell lines (observations). Each cell line has known cancer type; however, the goal of this analysis to explore the extent to which gene expression data can be used to characterize and identify different types of cancer.

library(ISLR)
data("NCI60")
nciData <- NCI60$data
labels <- NCI60$labs

Part A: Perform PCA on the gene expression data (be sure to standardize) storing your results in an object named nci_pca. Report the amount of variation explained by the first three principal components.

Part B: Construct a data frame containing scores for the first three principal components and add the “labels” vector as an additional column. Next, filter this data frame to include only the “PROSTATE”, “OVARIAN”, “COLON”, and “MELANOMA” cancer types. Then, using the filtered data, create a 3-D scatter plot (plotly) that displays each sample’s scores in \(PC_1\), \(PC_2\), and \(PC_3\) and is colored by cancer type.

\(~\)

Question #2

The “wines” data set records the results of a chemical analysis on wine samples produced by three different cultivators in the same region of Italy.

wines <- read.csv("https://remiller1450.github.io/data/wines.csv")

The goal of this application is to identify and assess differences across wines and cultivators.

Part A: Use functions in the corrplot package to visualize a correlation matrix of these data. Then, briefly justify principal component analysis as a reasonable approach to analyzing these data.

Part B: Perform PCA on the wines data (be sure to standardize and remove the “Origin” variable) storing your results in an object named wines_pca. Then, use parallel analysis to determine how many components should be retained.

Part C: Calculate and report the amount of variation explained by each of the components you chose to retain in Part B.

Part D: Create a graph that displays the top contributors to \(PC_1\). Using this graph, determine a label for describing this component (Hint: phenols are a group of phytochemicals that account for antioxident activity, flavonoids are the largest group of phenolic compounds, and proanthocyanins are a polyphenol).

Part E: Create a graph that displays the top contributors to \(PC_2\). Briefly describe how this component appears to capture wine characteristics that are different from those that most strongly contribute to \(PC_1\).

Part F: Create a biplot displaying scores and variable loadings for the first two principal components. Use the col.ind argument to color the individual observations according to the 3 different values of the “Origin” variable in the original data set. You might also use the argument label = FALSE to suppress labeling. Then, using the biplot, briefly interpret the differences you see across these three cultivators.

Part G: Use group_by and summarize to find the average scores in \(PC_1\), \(PC_2\), and \(PC_3\) for each of the three cultivators. Briefly comment on how this information relates to the biplot you created in Part F.

\(~\)

Question #3

The code given below loads the responses to an online questionnaire aimed at measuring “dark triad” personality traits, which are machiavellianism (a manipulative attitude), narcissism (excessive self-love), and psychopathy (lack of empathy).

For more information you can visit this link: http://openpsychometrics.org/tests/SD3/

dark = read.delim("https://remiller1450.github.io/data/dark_triad.csv", sep='\t')

Part A: After removing variables that are not survey items and standardizing, perform PCA and create a scree plot showing the variance explained in each principal component. Based upon this plot, does it seem like 3 underlying factors capture most of the variation in the survey items? Briefly explain.

Part B: Using the principal component loadings, relate each of the “dark triad” traits to the most appropriate principal component. Your submitted answer should summarize your label assignments and how you came up with them.

Part C: The graphic below displays the distribution of scores in the first three principal components for respondents from 3 different countries of origin, Germany (DE), India (IN), and the Philippines (PH). Recreate this graphic, replacing the names “PC1”, “PC2” and “PC2” with the names of the dark triad traits you determined in Part B. It’s acceptable if your scores are mirrored across the y-axis (vertical line at zero) relative to the target visualization shown below.

Part D: Using the principal component loading and the survey items/scale information shown below to inform your answer, which of the 3 countries of origin depicted in the above visualization show tends exhibit the most narcissism (excessive self love) among its respondents? Explain your answer.

The Appendix of Jones and Paulhus (2014) (https://journals.sagepub.com/doi/full/10.1177/1073191113514105) displays actual question wording used.

Question #4

The goal in this question is to apply clustering methods to find possible groupings/patterns among individuals killed during police interactions in 2015. The data for this application come from the FiveThirtyEight article “Where Have Police Killed Americans in 2015”.

For this question, you should use the processed data stored in the dataframe pk, which is created below:

## Source data from FiveThirtyEight
police <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/police-killings/police_killings.csv", stringsAsFactors = TRUE)

## Select important variables and coerce misread var types
pk <- select(police, age, gender, raceethnicity, state, cause, armed, share_white, share_black, share_hispanic, h_income, pov, urate, college) %>%
      mutate(age = as.numeric(as.character(age)), pov = as.numeric(as.character(pov)), 
             share_white = as.numeric(as.character(share_white)), 
             share_black = as.numeric(as.character(share_black)), 
             share_hispanic = as.numeric(as.character(share_hispanic)))

## Add individual name, city as rownames
rownames(pk) <- paste0(police$name, " (", police$city, ")")

## Remove cases with missing data in any of the kept variables
pk <- pk[complete.cases(pk),]

Part A: Based upon an inspection of the data frame, pk, should euclidean distance or Gower distance be used if the goal is to display a distance matrix summarizing the similarities/differences of these data-points? Briefly explain.

Part B: Find the distance matrix using the metric you specified in Part A and store it in an object named D, then apply PAM clustering with \(k = 4\) and print the medoids.

Part C: The fviz_nbclust() function is not compatible with clustering done using Gower distance, so the choice of \(k\) must be optimized manually. The code below uses a for loop to display average silhouette widths for several possible choices of \(k\). Based upon these results, which choice of \(k\) would you recommend? Briefly explain.

k_seq <- 2:10
for(i in 1:length(k_seq)){
  pam_res <- pam(D, k = k_seq[i])
  print(paste("k=",  k_seq[i], "Avg sil", round(pam_res$silinfo$avg.width, 3)))
}

## [1] "k= 2 Avg sil 0.189"
## [1] "k= 3 Avg sil 0.115"
## [1] "k= 4 Avg sil 0.138"
## [1] "k= 5 Avg sil 0.133"
## [1] "k= 6 Avg sil 0.135"
## [1] "k= 7 Avg sil 0.124"
## [1] "k= 8 Avg sil 0.105"
## [1] "k= 9 Avg sil 0.103"
## [1] "k= 10 Avg sil 0.095"

Part D: Apply PAM clustering using the choice of \(k\) you identified in Part C. Then, using the medoids of each cluster, come up with a brief description of the different clusters found by the algorithm. Your description may focus on three or four variables that appear most different.

Part E: Apply DIANA clustering to these data, storing the results as pk_diana, then use the command cutree(pk_diana, k = 2) find the cluster assignment of each observation. Add these clusters as an additional variable to pk.

Part F: Use the table() function to create a table displaying cluster assignments as rows and frequencies of each category of the variable “raceethnicity” as columns. Do these distributions appear to align with the characteristics of the clusters reported in Part D?