Before you begin, make sure to load the following packages,
installing if necessary with install.packages("ggplot2")
and install.packages("dplyr"). I tend to put “message=F,
warning=F” in the R code chunk setup when I load these
otherwise you get a bunch of messages that take up your page. Using
“echo=F” will prevent the code from showing. Using “eval=F” will prevent
the code from running, but will still show what you have written.
library(ggplot2)
library(dplyr)
We will work through a series of questions in this lab, which will cover some of the topics we went over in the Introduction and Quantitative Variables (Part 1) slides. Fill in answers to the questions presented throughout this document. You may work together with others, but each student will need to submit their own version of this file to Gradescope as a .pdf with all answers filled out.
You may use the template .Rmd file found here to fill in your answers.
The distribution of a variable is a description of how frequently values of that variable show up. We saw that the way in which we describe the distribution of a variable is different depending on if the variable is categorical or quantitative.
Question 1: What things do we need to include to describe the distribution of a quantitative variable?
Answer: Shape, center, spread, outliers, and context
Question 2: If I have a symmetric distribution with no (or few) outliers, which measures of center and spread should I use?
Answer: mean and standard deviation
Question 3: If I have a skewed distribution or one with many or big outliers, which measures of center and spread should I use?
Answer: median and IQR
Question 4: Why do skew/outliers make mean and standard deviation bad measures of center and spread?
Answer: Really big or really small outliers make the mean really big or small respectively. Outliers make the standard deviation really big because of the square term.
For the next (many) questions we are going to use the College data set presented in the last few sets of slides. Read in the dataset for the College data using the following code. In the R file for this lab you will also see code that creates these graphics (we will talk more about this in a couple of weeks) and a few other code chunks that you may run to get summary statistics for this data.
colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019_Complete.csv")
Part A: Describe the distribution of admission rates for the colleges in our sample (include context).
Answer: Admission rate for the colleges is left skewed with a median of 69.2% and an IQR of around 24%. There are no outliers.
Part B: Describe the distribution of enrollment for our sample of colleges.
Answer: Enrollment for the colleges is right skewed with several large outliers. The median enrollment is 2733 with an IQR of around 5800 students.
Part C: Describe the distribution of the percent of female students for colleges in our sample.
Answer: Percent female enrollment is unimodal and symmetric. The mean is 58% with a standard deviation of 10%.
Part A: Using the box plot of admission rates, give me approximate values for the following:
Answer: See summary() output above.
Part B: Compare your answer to the previous question
to the summary() output for Part 2A above. Are they
similar?
Answer: See summary() output above.
Part C: In 2019, Grinnell had an admission rate of 0.24 (it has since gotten lower to around 0.12). Where does Grinnell fall on this boxplot? Is it an outlier college?
Answer: It is on the far left side. It is a tough call whether it is an outlier or not. If not, it is still very low compared to most colleges.
Part D: Describe the distribution of the
average net tuition (tuition left to be paid after grants
and scholarships) for the colleges.
Answer: Average net tuition is right skewed with several large outliers. The median is $12,000 with an IQR of about $6,000.
We have seen the terms population, parameter, sample, statistic, and observation in the Introduction slides. These terms are important for helping us describe data and understand what the purpose of a study is. Being able to read a summary of a study and label these individual parts is going to be an important skill we will use all semester.
When we are describing the population, sample, and observations in a study, we want to provide adequate context to explain the study and data. The following are some things to consider when reading a description of a study.
Who – Who collected the data, who is the data collected on? How many observations are there?
What – What variables were data recorded on?
When – When was the data collected? Populations can change over time and old data does not always reflect how things are now
Where – Where was the data collected? Different geographical areas can have vastly different populations
Why – Why was the data collected? What research question(s) were the investigators trying to answer?
How – How was the data physically collected?
We may not always use all of these terms in our own descriptions, but they are useful to add context to our data, and potentially see if there are any issues with the study.
Part A: (Healthcare Opinions) In 2009, the PEW research group wanted to learn more about public opinion on the idea of the public option for health coverage. One thing that they wanted to know was the percentage of adult U.S. residents who favored a public option for health coverage in October 2009. In a poll of 1500 randomly selected Adult residents in the United States, they found that 55% of adult residents favored a government health insurance plan to compete with private plans. Source
population: all adult U.S. residents in 2009
sample: 1500 adult U.S. residents in 2009
observation: an adult U.S. resident in 2009
What is the variable of interest in this study? Is it categorical or quantitative?
The variable of interest is ‘Whether or not the adult favored a public option’. Categorical.
There are many correct answers to this question. My thought is that the data is probably not that useful for this question. Public opinion can change rapidly over time. This data is 15+ years old and may not reflect current opinions about healthcare questions.
Part B: (National household size) The American Community Survey (ACS) conducts yearly surveys. One thing that is of interest is the average household size. In April 2022, the ACS had surveyed 1,980,550 U.S. households and found the average household size to be 2.50. Source
Describe the population in this study: all US households in April 2022
Describe the sample in this study: 1,980,500 US households in April 2022
Describe an observation in this study: a US household in April 2022
What is the variable of interest in this study? Is it categorical or quantitative?
The variable of interest is ‘household size’. Quantitative.
Part C: (Real Life Engineering Example) Forty prismatic lithium-ion pouch cells were built at the University of Michigan Battery Laboratory. Cells were formed using two different formation protocols: “fast formation” and “baseline formation”. After formation, the cells were put under cycle life testing at room temperature and 45degC. Cells were cycled until the discharge capacities dropped below 50% of the initial capacities and the number of cycles was recorded.
Describe an observation in this study: a prismatic lithium ion cell built at this lab
Describe the sample in this study: 40 prismatic lithium ion cells built at this lab
Describe the population in this study: All lithium ion cells. For the purpose of defining the population we are looking at the big group we want to eventually say things about, so I am not limiting myself to just batteries produced in this lab for the population. We will talk more about this later.
What question do you think the researchers were trying to answer?:
Do the different formation protocols cause batteries to have different lifespans (in terms of # of cycles)?
Using the side-by-side box plots below, answer the following questions.
Part A: What is the shape of ‘South East’s’ box plot? What about ‘Mid East’?
Answer: The boxplot for South East is skewed-right. The boxplot for Mid East is roughly symmetric.
Part B: Which region’s boxplot has the largest median and what is the value of the median?
Answer: There are 3 regions tied for the largest median with a value of 24: Rocky Mountains, New England, and Mid East
Part C: Which region has the largest IQR? Give an approximate value of the IQR for this region and show your calculation.
Answer: The region with the largest IQR is New England. The value of the IQR is approximately 29-22=7. The visual width of the box corresponds to IQR. When making quick comparisons we can just look at this width to determine the region with largest IQR.
Part D: What does it mean to say two variables are associated?
Answer: Knowing something about one variable tells us info about the other.
Part E: According to this plot, does it look like there is an association between Region and Median ACT of colleges?
Answer: Yes. Some regions have higher or lower median ACTs than others. Knowing which region a college is in gives us a better idea of what median ACT it could have.
Part A: Using the scatterplot above, describe whether there is a relationship/association between Average Faculty Salary and Median 10-year Salary (the median salary of graduates from the college 10 years after receiving their degree) for our sample of colleges.
Answer: Yes, there is a relationship. It looks like generally as Average Faculty Salary goes up, so does Median 10-yr salary of graduates. “There is a moderately strong, positive linear relationship between Average Faculty Salary and Median 10-yr Salary of graduates. There are a couple of outliers with large 10-yr median graduate salaries.”
Part B: This is another scatterplot similar to the previous one, but I have added information on whether the colleges are public or private. Is the relationship between Average Faculty Salary and Median 10-year Salary different for public and private colleges? Briefly explain (1 or 2 sentences).
Answer: The relationship between Average Faculty salary and Median 10-yr salary of graduates is very similar for both private and public colleges. Private colleges tend to have slightly higher median 10-yr salaries of graduates than public colleges for similar values of Average Faculty Salaries.