The dataset below contains the results from a poll based on a random sample with two variables: response, indicating their response to the poll question, and political, reporting their self-reported political ideology.
A number of randomly sampled registered voters from Tampa, FL were asked if they thought workers who have illegally entered the US should be (i) allowed to keep their jobs and apply for US citizenship, (ii) allowed to keep their jobs as temporary guest workers but not allowed to apply for US citizenship, or (iii) lose their jobs and have to leave the country.
## Copy and run this code to create table
library(ggplot2)
library(dplyr)
immigration <- read.csv("https://collinn.github.io/data/immigrationpoll.csv")
with(immigration, table(response, political)) %>% addmargins(1)
## political
## response conservative liberal moderate
## Apply for citizenship 57 101 120
## Guest worker 121 28 113
## Leave the country 179 45 126
## Not sure 15 1 4
## Sum 372 175 363
We will make a confidence interval to answer the question “What proportion of conservative Tampa voters support workers being ‘allowed to keep their jobs and apply for US citizenship.’
Part A: Describe the parameter of interest, including which symbol we use for it.
Answer: p = proportion of conservative Tampa voters that support ‘workers being allowed to keep their jobs and apply for US citizenship’
Part B: What is the corresponding value of the statistic and what symbol do we use for it?
# p-hat
57 / 372
## [1] 0.1532258
Part C: What is the sample size for the group of conservatives?
# n
372
Part D: Check the conditions for making a confidence interval.
# Random sample: met
# Success condition: 57 successes (met)
# Failure condition: 315 failures (met)
372 - 57
## [1] 315
Part E: Create a 95% confidence interval for the parameter.
p_hat = (57/352)
n = 372
p_hat - 1.96 * sqrt(p_hat * (1-p_hat) / n)
## [1] 0.1244957
p_hat + 1.96 * sqrt(p_hat * (1-p_hat) / n)
## [1] 0.1993679
Part F: Interpret the confidence interval.
Answer: We are 95% confident that the true proportion of conservative Tampa voters that are in favor of citizenship options for workers that have illegally entered the US is between .124 and .199. Somewhere between roughly 1 in 5 or 1 in 8 voters.
(Alternative) We are 95% confident that the true percentage of conservative Tampa voters that are in favor of citizenship options for workers that have illegally entered the US is between 12.4% and 19.9%. Somewhere between roughly 1 in 5 or 1 in 8 voters.
Let’s see if there is a difference between conservatives and liberals in terms of proportions that support workers being ‘allowed to keep their jobs and apply for US citizenship.’
Part A: What is the value of the statistic of interest? (Make ‘Liberal’ the first group)
#p-hat_L - p-hat_C
(101/175) - (57/372)
## [1] 0.4239171
Part B: Check the conditions to make a confidence interval.
# Random samples: met
# Independent groups: met
# Success condition (Liberal): 101 successes (met)
# Failure condition (Liberal): 175-101 = 74 failures (met)
# Success condition (Conservative): 57 successes (met)
# Failure condition (Conservative): 372-57 = 315 failures (met)
Part C: Make a 90% confidence interval.
p1 = (101/175)
p2 = (57/372)
n1 = 175
n2 = 372
qnorm(.95)
## [1] 1.644854
(p1 - p2) - 1.645 * sqrt(p1*(1-p1)/n1 + p2*(1-p2)/n2)
## [1] 0.3552326
(p1 - p2) + 1.645 * sqrt(p1*(1-p1)/n1 + p2*(1-p2)/n2)
## [1] 0.4926015
Part D: Interpret the confidence interval. Include a sentence summary for someone who has no stats background.
Answer: We are 95% confident that the true percentage of liberal Tampa voters is between 35.5% and 49.2% higher than conservative Tampa voters. According to this survey, it looks like liberal Tampa voters are much more likely to support a citizenship option than conservative voters.
Part E: According to the CI, is it plausible there is no difference between the groups?
Answer: No. The entire confidence interval positive. Zero is nowhere close to our interval.
The fraction of workers who are considered “supercommuters”, because they commute more than 90 minutes to get to work, varies by state. Suppose the 1% of Nebraska residents and 6% of New York residents are supercommuters. Now suppose that we plan a study to survey 1000 people from each state, and we will compute the sample proportions \(\hat{p}_{NE}\) for Nebraska and \(\hat{p}_{NY}\) for New York.
Answer: These questions are talking about a sampling distribution in different words. The associated mean will be 0.01 and standard deviation will be
sqrt(.01*.99 / 1000)
## [1] 0.003146427
in repeated samples of size 1000.
The associated mean will be 0.06 and standard deviation will be
sqrt(.06*.94 / 1000)
## [1] 0.007509993
in repeated samples of size 1000.
The associated mean will be 0.05 (because 0.06 - 0.01 = 0.05) and standard deviation will be
sqrt(.06*.94 / 1000 + .01*.99/1000)
## [1] 0.008142481
in repeated samples of size 1000.
The mean of the differences in this distribution will be roughly 0.05. The standard deviation value tells us that on average samples will result in \(\hat{p}_{NY} - \hat{p}_{NE}\)’s that are 0.008 away from this mean.
\(SD_{\hat{p}_{NY} - \hat{p}_{NE}}^2 = SD_{\hat{p}_{NY}}^2 + SD_{\hat{p}_{NE}}^2\)
\(Var(\hat{p}_{NY} - \hat{p}_{NE}) = Var(\hat{p}_{NY}) + Var(\hat{p}_{NE})\)
The variability of the difference is the sum of variability in the individual groups.