Question 1A: What are the explanatory and response variables?
Question 1B: Describe the relationship between the two variables. Make sure to discuss unusual observations, if any.
Question 1C: Can we conclude that having a bachelor’s degree increases one’s income?
The following scatterplot was created as part of a study evaluating the relationship between estimated life expectancy at birth (as of 2014) and percentage of internet users (as of 2009) in 208 countries for which such data were available.
Question 2A: Describe the relationship between life expectancy and percentage of internet users. Make sure to mention all 4 aspects.
Question 2B: What type of study is this?
Question 2C: State a possible confounding variable that might explain this relationship and describe its potential effect.
According to the Happy Planet Index: 2024 Report, “The Happy Planet Index is a measure of sustainable wellbeing, evaluating countries by how efficiently they deliver long, happy lives for their residents using our limited environmental resources.” We are actually going to use the 2012 data from Happy Planet Index for formatting reasons (it’s a pain to format the data set for this lab). A link to the site is available here.
Make a new RMarkdown document and give it an appropriate name. I only want to see your questions and necessary code for this lab. Make sure to use appropriate formatting so your answers are easy to read and distinguish from the questions. Make sure to copy the questions wording themselves to the new document as you answer them.
Copy the code below to read the 2012 Happy Planet data into R. Load the appropriate packages as well.
library(ggplot2)
library(dplyr)
data = read.table("https://nfriedrichsen.github.io/data/HappyPlanetIndex.txt", header=T, sep=",")
Happy = data.frame(data)
Happy$Region = as.factor(Happy$Region)
We are going to examine the following variables:
Happiness: Larger values indicate greater happiness, health, and well-being of the country’s citizens. Happiness does not have a measurement unit.
Footprint: Ecological Footprint is a measure of the (per capita) ecological impact a country has. Larger values indicate a greater ecological impact.
Life Expectancy: Life Expectancy refers to the average life expectancy (in years) for a country.
GDPperCapita: The GDP (gross domestic product) of a country divided by its population, measured in US dollars.
Region: Region of the country in the world.
Happiness with Footprint.Suppose we want to answer the question “Is ecological footprint a linear predictor of happiness?” Look at the scatterplot below and answer the following questions.
# Scatterplot of Footprint vs Happiness
theme_set(theme_bw())
ggplot(Happy, aes(x=Footprint, y=Happiness)) + geom_point()
Question 4A: Which is the explanatory variable and which is the response variable? Explain.
Question 4B: Describe the relationship between the happiness of a country and it’s ecological footprint.
Question 4C: Is it appropriate to use Pearson’s correlation to quantify the relationship between these two variables? Explain.
Question 5: Use the filter() function
to remake this scatterplot but only for Region 1. How would you describe
the relationship for Region 1 countries? What does Region 1 seem to
correspond to?
Now let’s change things up and try to find a variable that does a good job predicting happiness. One of the ways we can try to do this is look for variables that have a high correlation with Happiness. (This does not always work, but can be a good place to start).
Often times when we try to examine the correlation between many
variables at the same time, it can be helpful to arrange them in a table
or matrix. The output below computes the Pearson’s correlation
coefficient between the variables Happiness,
LifeExpectancy, and HDI. We can look at the
intersection of variables in the table to see the corresponding
correlation between them. The correlation between a variable and itself
is always 1, so we don’t care about looking at these entries.
Happy %>% select(Happiness, LifeExpectancy, GDPperCapita) %>%
cor(use="complete.obs")
## Happiness LifeExpectancy GDPperCapita
## Happiness 1.0000000 0.8334278 0.6976830
## LifeExpectancy 0.8334278 1.0000000 0.6662072
## GDPperCapita 0.6976830 0.6662072 1.0000000
Question 6A: Of the two other variables in the correlation matrix, which has the strongest correlation with Happiness?
Question 6B: Make a scatterplot of this variable + Happiness. Describe the relationship. Is Pearson’s correlation appropriate to use in this scenario?
Question 6C: Explain why just relying on the correlation coefficient to determine linear relationships is insufficient.
Question 7: For each of the six plots, identify the strength of the relationship (e.g., weak, moderate, or strong) in the data and whether fitting a linear model would be reasonable.
Question 8: Match each correlation to the corresponding scatterplot.
We have seen how to calculate test-statistics in R, which mostly
involved lots of parentheses and multiplication or division. We then saw
how to get p-values using the pt() function (means) or
pnorm() function (proportions). There are actually some
built in functions in R that will handle the test-statistic and p-value
computation.
The two functions we need are t.test() and
prop.test().
Proportions are very straightforward. We do not need an actual data set to run the hypothesis test, just the results.
Suppose in a random sample of 100 college students we found that 45 preferred online classes (vs. in-person). We might ask, is this evidence that the true proportion that prefer online classes less than half?
\(H_0\): p = 0.5
\(H_A\): \(p < 0.5\)
The prop.test function requires arguments that give the
# of responses of interest, the sample size, and the hypothesized
proportion.
prop.test(x = 45, n = 100, p = 0.5)
##
## 1-sample proportions test with continuity correction
##
## data: 45 out of 100, null probability 0.5
## X-squared = 0.81, df = 1, p-value = 0.3681
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.3514281 0.5524574
## sample estimates:
## p
## 0.45
Small thing about this p-value is that it is always based on the ‘not equal’ hypothesis, which is not what we always want. In this particular example we can remember that to get the not-equal hypothesis p-value we ended up multiplying by 2, so we can just undo that. The Test-statistic that is reported is actually squared (not helpful), so we take the square root to undo it, double checking for if it should be negative (it should, why?). The test-stat and p-value for this hypothesis test should be
HypTest = prop.test(x = 45, n = 100, p = 0.5)
# test-stat
-sqrt(0.81)
## [1] -0.9
# p-value
HypTest$p.value/2
## [1] 0.1840601
For a difference in proportions, we use the same function but need counts and sample sizes for each group.
H\(_0\): \(p_1 - p_2\) = 0
H\(_A\): \(p_1 - p_2 \neq 0\)
We may need to get these counts from data though, which we’ve seen
how to do, possibly involving group_by() and/or
summarize() functions.
Suppose that we had a random sample of 200 Grinnell students.
HypTest = prop.test(x = c(120, 65), n = c(200, 150))
HypTest
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(120, 65) out of c(200, 150)
## X-squared = 8.8979, df = 1, p-value = 0.002855
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## 0.05643812 0.27689522
## sample estimates:
## prop 1 prop 2
## 0.6000000 0.4333333
Note: We can ignore the degrees of freedom thing for now. It is not using the t-distribution. We also get CIs for the parameter.
For means we will use the t.test function. I will
demonstrate this on the HappyPlanet data, but we should be aware that
this is not a true random sample, and whatever HTs or CIs we calculate
from this data don’t have a clear meaning. At it’s most basic level, the
function requires a vector of data as an input:
t.test(Happy$Happiness)
##
## One Sample t-test
##
## data: Happy$Happiness
## t = 51.542, df = 142, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 5.691873 6.145889
## sample estimates:
## mean of x
## 5.918881
The t.test function always defaults to testing a mean value of 0 (not always what we want), and a two-sided test. We can certain things with the following arguments.
t.test(Happy$Happiness, mu = 5, alternative = "greater", conf.level = 0.90)
##
## One Sample t-test
##
## data: Happy$Happiness
## t = 8.0017, df = 142, p-value = 1.973e-13
## alternative hypothesis: true mean is greater than 5
## 90 percent confidence interval:
## 5.771026 Inf
## sample estimates:
## mean of x
## 5.918881
When working with a two-sample t-test, you must pass in two vectors of data. This next example would look at comparing mean happiness between Regions 1 and 2. Remember the interpretations of this are kind of unclear as it’s almost a census of countries.
data_R1 = Happy %>% filter(Region == "1") %>% select(Happiness)
data_R2 = Happy %>% filter(Region == "2") %>% select(Happiness)
t.test(x = data_R1, y = data_R2, alternative = "two.sided", conf.level = 0.95)
##
## Welch Two Sample t-test
##
## data: data_R1 and data_R2
## t = -3.6687, df = 43.264, p-value = 0.000664
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.9943255 -0.2890078
## sample estimates:
## mean of x mean of y
## 6.912500 7.554167
You may now use t.test and prop.test for hypothesis tests on HW assignments