Load the following packages, installing if necessary:
library(ggplot2)
library(dplyr)
library(gridExtra)
The distribution of a variable is a description of how frequently values of that variable show up. We saw that the way in which we describe the distribution of a variable is different depending on if the variable is categorical or quantitative.
Question 1: Below is a bar chart representing the hair color of students in a statistics class (color may be exaggerated in the chart). Describe the distribution of the haircolor variable.
(In the graph code chunks I have put the term ‘echo=FALSE’ in the brackets. This stops RStudio from showing the code in the pdf to save a little bit of space). We will talk more about how to make these graphs next week.
Answer: The most common hair color is brown, followed closely by blond, then by black. Red is the least common. It is helpful to give approximate values for each of these.
For this set of questions we are going to use the College data set presented in the last few sets of slides. Read in the dataset for the College data using the following code.
colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019_Complete.csv")
Question 2: How many observations and variables are there in the dataset? Explain how you found this answer and show any code (if you used any).
Answer: There are 1095 observations and 23 variables. This info is available in the Environment panel after loading the data.
Question 3: Look at the conditional bar chart below. Is there an association between the region and the type of college (public vs private) in our sample? Justify your answer using 1 or 2 sentences.
Answer: Yes. Not all of the regions have the same proportion of public vs. private. Knowing which region you are looking at tells you more info about the proportion of each type of school you should expect.
We will be using the Titanic dataset built into R, providing information on the fate of the passengers on the fatal maiden voyage of the ocean liner Titanic summarized according to economic status (class), sex, age, and survival. See ?Titanic for more details.
## head() shows us the first 6 rows of a data.frame
head(titanic)
## Class Sex Age Survived
## 3 3rd Male Child No
## 3.1 3rd Male Child No
## 3.2 3rd Male Child No
## 3.3 3rd Male Child No
## 3.4 3rd Male Child No
## 3.5 3rd Male Child No
Question 4: Use the following barchart to answer the questions.
Part A: Did more people survive or not survive the Titanic accident? Which bar chart is more helpful to answer this?
Answer: More people did not survive. The dodged bar chart is more helpful to see this.
Part B: How many male passengers survived. Roughly what percent of male passengers survived?
Answer: Around 367 male passengers survived (guesstimating). This is roughly 21%.
Part C: How many female passengers survived. Roughly what percent of female passengers survived?
Answer: Around 344 female passengers survived. This is roughly 73%.
Part D: Suppose a friend says Sex doesn’t seem to affect Survival that much because the number of males and females that survived is similar. Explain why they are not correct.
Answer: This is not accounting for the fact that different numbers of male and female passengers were on the boat. Most male passengers died. Most female passengers survived.
Question 5: The order of variables has been swapped. Use this bar chart to make another argument (different than your previous one) for why Survival and Sex are associated.
Answer: If we look at survivors, we can see it is about a 50/50 split between male and female passengers. But deaths were overwhelminly amongst male passengers. Sex is still associated with survival.