library(ggplot2)
library(dplyr)
theme_set(theme_bw())
## Better histograms
gh <- function(bins = 10) {
geom_histogram(color = 'black', fill = 'gray80', bins = bins)
}
## Bootstrapping function
bootstrap <- function(x, statistic, n = 1000L) {
bs <- replicate(n, {
sb <- sample(x, replace = TRUE)
statistic(sb)
})
data.frame(Sample = seq_len(n),
Statistic = bs)
}
This lab is intended to serve as an introduction to the construction of bootstrap confidence intervals in R.
Before we get into any exercises, let’s take a little bit of time to introduce some new functions and techniques that we will be using to answer the questions.
Helpful Note: This lab is written to be self-contained, meaning that everything that I will ask of you will be introduced in this lab. There is absolutely no expectation that you memorize everything.
The two exceptions to this are the use of ggplot() and
the use of dplyr functions, which were introduced in
previous labs. Extra guidance will be provided for questions involving
dplyr.
Often we will try and use real data for our problems, but it can also be instructive to generate random data. Not only does this permit us to generate as much data as we wish, it has the added bonus that we are able to know exactly the true values of our population.
For example, we can generate random data that is normally distributed
with the function rnorm() (“RandomNORMal”). The
rnorm() function takes three arguments: n,
indicating the sample size, mean, indicating the mean, and
sd, indicating the standard deviation. If I wanted to
generate 5 random variables from a normal distribution with mean value
10 and standard deviation 15, I would do:
rnorm(n = 5, mean = 100, sd = 15)
## [1] 106.0594 83.6008 75.4706 119.3057 107.7605
You’ll note that each time we generate random data, it gives us
something, well, random. We may find ourselves in situations where we
want to be able to recreate exactly a random process. We can do this
with the function set.seed():
set.seed(35)
rnorm(5, 100, 15)
## [1] 115.97688 101.99322 99.48934 99.32535 150.06757
set.seed(35)
rnorm(5, 100, 15)
## [1] 115.97688 101.99322 99.48934 99.32535 150.06757
As with everything else in R, we can assign these random
values to a variable and do things with it. For example, here we collect
a sample of 50 observations and compute the sample mean and standard
deviation
x <- rnorm(50, 100, 15)
# Sample mean
mean(x)
## [1] 103.9545
# Standard deviation
sd(x)
## [1] 14.08321
By default, the rnorm() function returns a
vector instead of a data.frame. If we wish to plot the random
data we create, we first have to put it in a data.frame
## Sample data and put into data.frame
x <- rnorm(n = 65, mean = 100, sd = 15)
df <- data.frame(x = x)
## Using gh() (assigned above) to make histograms that don't look awful
ggplot(df, aes(x)) +
gh(bins = 10)
In this section, we briefly demonstrated how to generate random data in R. This involved using:
rnorm(n, mean, sd) function to specify the
parametersmean(), sd(), and
length() to find the mean, standard deviation, and standard
errorggplot()Question 1: Generate 100 observations from the normal distribution with a mean value of 50 and a standard deviation of 20, then create a histogram showing the distribution of your sample.
Generating data using rnorm() is equivalent to
collecting a sample from a population of infinite size. Further, each
time we generate data, the observations are all completely different.
What we need next is a way to sample data from an existing,
finite population and we do this with the function
sample().
Here, for example, I take a sample of size 5 from the numbers 1 to 10
# This is my "population"
X <- 1:10
# Here, I take a sample of size 10
sample(X, size = 10)
## [1] 8 1 9 4 7 2 10 3 6 5
By default, the sample() function performs sampling
without replacement, meaning that after a number has been sampled
once, it cannot be sampled again. As such, sampling 10 numbers from the
set 1:10 will give us the same 10 numbers in a different order. We can
sample with replacement by passing a second argument,
replace = TRUE
# Sample with replacement
sample(X, size = 10, replace = TRUE)
## [1] 4 4 9 10 4 8 1 8 9 7
Now I should see that some of my numbers are repeated in the sample. Sampling with replacement is the type of sampling that we will be using with the bootstrap.
We can perform this kind of sampling on variables in a data.frame as
well. Consider the trees dataset built-in to R which
contains measurements of felled cherry trees. The distribution of tree
heights can be visualized simply enough:
ggplot(trees, aes(Height)) +
gh()
Here, I will sample tree height with replacement and again visualize it
## Sample tree heights
# (if I do not pass in `size = `, by default it will sample the same size)
th <- sample(trees$Height, replace = TRUE)
## Put sampled data in data.frame
df <- data.frame(Height = th)
## Create plot
ggplot(df, aes(Height)) +
gh() +
labs(title = "Distribution of Resampled Tree Height")
This is the equivalent of collecting one bootstrapped sample and plotting its distribution.
In this section, we introduced the sample() function, as
well as the argument replace = TRUE that allows us to
sample with replacement. This will be important in the
bootstrapping process.
Question 2: Do the histogram for all felled cherry tree heights and the histogram of the bootstrap sample of cherry tree heights above look similar? Explain why it does or does not (not how!).
We previously saw the qnorm() and qt()
functions which could be used to find quantiles of the normal and
t-distribution respectively.
We started with the coverage we wanted and divide the difference by two. For example, to find the middle 95%, we know we are excluding 5% at the edges; 2.5% on the left side and 2.5% on the right side. This gives us the quantiles (0.025, 0.975). Similarly, if we wanted an 80% confidence interval, our quantiles should be (0.1, 0.9).
When the distribution does not look like a Normal or t-distribution,
we cannot use the qnorm() or qt() functions.
We will instead use the quantile() function in R.
The quantile() function works by specifying the variable
you want quantiles of and also the specific quantile values. See the
example below to get the 2.5 and 97.5 quantiles for the cherry tree
heights.
quantile(x = trees$Height, probs=c(.025, .975))
## 2.5% 97.5%
## 63.75 86.25
The data below contains the total monthly precipitation in Grinnell, IA from February 2014 to February 2024 (121 total months). We will treat all of these 121 months as the population we are interested in.
The following code creates a histogram of the
precipitation variable, the amount of rainfall on a rainy
day (inches), and finds the mean and standard deviation of this
variable.
## rain data
rain <- read.csv("http://collinn.github.io/data/grinnell_rain.csv")
ggplot(rain, aes(precip)) +
gh(bins = 8) +
labs(x = "Precipitation (Inches)",
title = "Rain Precipitation 2014-2024\nGrinnell, IA")
# mean
mean(rain$precip)
## [1] 2.069835
# standard deviation
sd(rain$precip)
## [1] 1.966139
Question 3
Use the qnorm() function and these values for the
mean and standard deviation to find the the 0.025 and 0.975 quantiles.
Does the lower value make sense?
Using the quantile() function find the 0.025 and 0.975 quantiles
of the precip variable. Do these values seem more
reasonable for the histogram?
Why are the quantiles you found in the two previous parts so different? Think about the histogram shape.
We can think of the process of bootstrapping as following a simple algorithm:
$n$, we sample with
replacement a new sample from the original. This is the equivalent
of pulling out a single marble from a bag of marbles, noting its color,
then putting it back and drawing from the bag againTo help us with this procedure, we will need a new function called
bootstrap() which I provided at the top of this lab. The
bootstrap function takes three arguments: the data I want to bootstrap,
the statistic I want to bootstrap, and the number of
bootstrapped samples I wish to collect (by default, it will collect
1000). We will illustrate its use again using the rain data
to find a confidence interval for the true average rainfall per
month.
## Want to bootstrap the sample of rainy days, rain$precip
mean_boot <- bootstrap(rain$precip, mean)
head(mean_boot)
## Sample Statistic
## 1 1 2.243058
## 2 2 2.237107
## 3 3 1.880083
## 4 4 1.976942
## 5 5 2.163223
## 6 6 2.264050
head(mean_boot) prints out the first 6 rows of the
data.frame returned by bootstrap(): we see that the first
column indicates the sample number, while the second column indicates
the sample statistic, in our case the mean. Because it is a data.frame,
we can easily plot it:
ggplot(mean_boot, aes(Statistic)) +
gh(bins = 15)
Remember: this is a sampling distribution of the mean value.
We resampled our data 1,000 times, computed the sample mean each time,
and this is the distribution of values that were returned. From this
point, we can use the quantile() function to return the
quantiles that we want. Because we want the middle 95%, we need to take
our quantiles at 2.5% and 97.5%
quantile(mean_boot$Statistic, probs = c(0.025, 0.975))
## 2.5% 97.5%
## 1.741868 2.432165
Question 4
For this problem, we will assume that we have only collected data on
40 random months in this period, and our goal will be to estimate the
true value of the mean (\(\mu =
2.069\)) given the sample that we have. In other words, we will
be using rs (Rain Sample) as our observed sample
dataset.
## Create a random index of months for our sample
# (copy these lines into your lab)
set.seed(123)
idx <- sample(1:nrow(rain), size = 40, replace = FALSE)
rs <- rain[idx, ]
# histogram of the sample
ggplot(rs, aes(x=precip)) + gh(bins=10)
Using the random sample of 40 months of Grinnell rain data above, make a bootstrap distribution for the mean using 2000 bootstrap samples.
Use the quantile() function to get the 2.5 and 97.5
percentiles of the bootstrap distribution. This makes a 95% CI for the
mean precipitation per month in Grinnell. Is the true mean value of
\(\mu = 2.09\) within this
interval?
The following makes a 95% CI for the median, which we couldn’t do with the Normal or t-distribution methods (CLT only works for means and proportions!!).
med_boot <- bootstrap(rs$precip, median, n=2000)
ggplot(med_boot, aes(x=Statistic)) + gh(bins=10)
quantile(x=med_boot$Statistic, probs=c(.025, .975))
## 2.5% 97.5%
## 1.049875 2.415000
The Statkey Website we used previously for sample distributions has a handy feature that lets us make bootstrap distributions as well, but only for means/proportions/diff. in means/diff. in proportions/Median/SD. Slope/correlation are included but we have not done those yet, so please ignore them.
Underneath “Bootstrap Confidence Intervals” click “Single Mean/Median/Std. Dev. Button at the top of the graph allow us to change datasets, edit data, upload data, generate samples, and reset the plot. Instead of Mustang Price, change the drop down to Manhattan Apartments (Rent). In the top right we can see a histogram of the original sample as well as the sample statistics (\(\bar{x}\) = 3156.5, \(s\)=1372.069, n=20).
Part A: Generate 5000 samples to create a bootstrap distribution. What does a dot on this distribution represent?
Part B: Click on the “Two-tail” button towards the
top of the dot plot. Click on the bubble in the middle and enter
0.95 to make a 95% confidence interval for the mean.
Interpret this interval by telling me what we are estimating and what we
think of the value.
Part C: Look at the shape of the bootstrap distribution. Is it completely Normal? Does this prevent us from making the confidence interval?
Go to this Statkey Link to work with difference in proportions. Go to the “Student Survey: Smoke by Sex” dataset. The “count” in the top right tells us how many of each group admitted to smoking (timeframe indeterminate) in this survey.
Part A: What are the sample proportions of each group that admitted to smoking? Do they seem to be the same/similar?
Part B: Make a 90% bootstrap confidence interval and record the endpoint values.
Part C: Do these groups seem very different in smoking status once we account for sampling variability?
Part D: What condition do we need to be true to make a bootstrap confidence interval?