Please document your answers to all homework questions using R Markdown, submitting your compiled output on Gradescope.
\(~\)
Unsupervised learning approaches, including as PCA, are frequently used in the analysis of genomic data, which often contain thousands of genetic variables.
The dataset NCI60 in the ISLR package
contains 6830 gene expression measurements (variables) for 64 cancer
cell lines (observations). Each cell line has known cancer type;
however, the goal of this analysis to explore the extent to which gene
expression data can be used to characterize and identify different types
of cancer.
library(ISLR)
data("NCI60")
nciData <- NCI60$data
labels <- NCI60$labs
Part A: Perform PCA on the gene expression data (be
sure to standardize) storing your results in an object named
nci_pca. Report the amount of variation explained by the
first three principal components.
Part B: Construct a data frame containing scores for
the first three principal components and add the “labels” vector as an
additional column. Next, filter this data frame to include only the
“PROSTATE”, “OVARIAN”, “COLON”, and “MELANOMA” cancer types. Then, using
the filtered data, create a 3-D scatter plot (plotly) that
displays each sample’s scores in \(PC_1\), \(PC_2\), and \(PC_3\) and is colored by cancer type.
\(~\)
The “wines” data set records the results of a chemical analysis on wine samples produced by three different cultivators in the same region of Italy.
wines <- read.csv("https://remiller1450.github.io/data/wines.csv")
The goal of this application is to identify and assess differences across wines and cultivators.
Part A: Use functions in the corrplot
package to visualize a correlation matrix of these data. Then, briefly
justify principal component analysis as a reasonable approach to
analyzing these data.
Part B: Perform PCA on the wines data (be sure to
standardize and remove the “Origin” variable) storing your results in an
object named wines_pca. Then, use parallel analysis to
determine how many components should be retained.
Part C: Calculate and report the amount of variation explained by each of the components you chose to retain in Part B.
Part D: Create a graph that displays the top contributors to \(PC_1\). Using this graph, determine a label for describing this component (Hint: phenols are a group of phytochemicals that account for antioxident activity, flavonoids are the largest group of phenolic compounds, and proanthocyanins are a polyphenol).
Part E: Create a graph that displays the top contributors to \(PC_2\). Briefly describe how this component appears to capture wine characteristics that are different from those that most strongly contribute to \(PC_1\).
Part F: Create a biplot displaying scores and
variable loadings for the first two principal components. Use the
col.ind argument to color the individual observations
according to the 3 different values of the “Origin” variable in the
original data set. You might also use the argument
label = FALSE to suppress labeling. Then, using the biplot,
briefly interpret the differences you see across these three
cultivators.
Part G: Use group_by and
summarize to find the average scores in \(PC_1\), \(PC_2\), and \(PC_3\) for each of the three cultivators.
Briefly comment on how this information relates to the biplot you
created in Part F.
\(~\)
\(~\)
The code given below loads the responses to an online questionnaire aimed at measuring “dark triad” personality traits, which are machiavellianism (a manipulative attitude), narcissism (excessive self-love), and psychopathy (lack of empathy).
For more information you can visit this link: http://openpsychometrics.org/tests/SD3/
dark = read.delim("https://remiller1450.github.io/data/dark_triad.csv", sep='\t')
Part A: After removing variables that are not survey items and standardizing, perform PCA and create a scree plot showing the variance explained in each principal component. Based upon this plot, does it seem like 3 underlying factors capture most of the variation in the survey items? Briefly explain.
Part B: Using the principal component loadings, relate each of the “dark triad” traits to the most appropriate principal component. Your submitted answer should summarize your label assignments and how you came up with them.
Part C: The graphic below displays the distribution of scores in the first three principal components for respondents from 3 different countries of origin, Germany (DE), India (IN), and the Philippines (PH). Recreate this graphic, replacing the names “PC1”, “PC2” and “PC2” with the names of the dark triad traits you determined in Part B. It’s acceptable if your scores are mirrored across the y-axis (vertical line at zero) relative to the target visualization shown below.
Part D: Using the principal component loading and the survey items/scale information shown below to inform your answer, which of the 3 countries of origin depicted in the above visualization show tends exhibit the most narcissism (excessive self love) among its respondents? Explain your answer.
The Appendix of Jones and Paulhus (2014) (https://journals.sagepub.com/doi/full/10.1177/1073191113514105) displays actual question wording used.
The goal in this question is to apply clustering methods to find possible groupings/patterns among individuals killed during police interactions in 2015. The data for this application come from the FiveThirtyEight article “Where Have Police Killed Americans in 2015”.
For this question, you should use the processed data stored in the
dataframe pk, which is created below:
## Source data from FiveThirtyEight
police <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/police-killings/police_killings.csv", stringsAsFactors = TRUE)
## Select important variables and coerce misread var types
pk <- select(police, age, gender, raceethnicity, state, cause, armed, share_white, share_black, share_hispanic, h_income, pov, urate, college) %>%
mutate(age = as.numeric(as.character(age)), pov = as.numeric(as.character(pov)),
share_white = as.numeric(as.character(share_white)),
share_black = as.numeric(as.character(share_black)),
share_hispanic = as.numeric(as.character(share_hispanic)))
## Add individual name, city as rownames
rownames(pk) <- paste0(police$name, " (", police$city, ")")
## Remove cases with missing data in any of the kept variables
pk <- pk[complete.cases(pk),]
Part A: Based upon an inspection of the data frame,
pk, should euclidean distance or Gower distance be used if
the goal is to display a distance matrix summarizing the
similarities/differences of these data-points? Briefly explain.
Part B: Find the distance matrix using the metric
you specified in Part A and store it in an object named D,
then apply PAM clustering with \(k =
4\) and print the medoids.
Part C: The fviz_nbclust() function is
not compatible with clustering done using Gower distance, so the choice
of \(k\) must be optimized manually.
The code below uses a for loop to display average
silhouette widths for several possible choices of \(k\). Based upon these results, which choice
of \(k\) would you recommend? Briefly
explain.
k_seq <- 2:10
for(i in 1:length(k_seq)){
pam_res <- pam(D, k = k_seq[i])
print(paste("k=", k_seq[i], "Avg sil", round(pam_res$silinfo$avg.width, 3)))
}
## [1] "k= 2 Avg sil 0.189"
## [1] "k= 3 Avg sil 0.115"
## [1] "k= 4 Avg sil 0.138"
## [1] "k= 5 Avg sil 0.133"
## [1] "k= 6 Avg sil 0.135"
## [1] "k= 7 Avg sil 0.124"
## [1] "k= 8 Avg sil 0.105"
## [1] "k= 9 Avg sil 0.103"
## [1] "k= 10 Avg sil 0.095"
Part D: Apply PAM clustering using the choice of \(k\) you identified in Part C. Then, using the medoids of each cluster, come up with a brief description of the different clusters found by the algorithm. Your description may focus on three or four variables that appear most different.
Part E: Apply DIANA clustering to these data,
storing the results as pk_diana, then use the command
cutree(pk_diana, k = 2) find the cluster assignment of each
observation. Add these clusters as an additional variable to
pk.
Part F: Use the table() function to
create a table displaying cluster assignments as rows and frequencies of
each category of the variable “raceethnicity” as columns. Do these
distributions appear to align with the characteristics of the clusters
reported in Part D?