library(ggplot2)
library(dplyr)
theme_set(theme_bw())
## Better histograms
gh <- function(bins = 10) {
geom_histogram(color = 'black', fill = 'gray80', bins = bins)
}
## Bootstrapping function
bootstrap <- function(x, statistic, n = 1000L) {
bs <- replicate(n, {
sb <- sample(x, replace = TRUE)
statistic(sb)
})
data.frame(Sample = seq_len(n),
Statistic = bs)
}
This lab is intended to serve as an introduction to the construction of bootstrap confidence intervals in R.
Before we get into any exercises, let’s take a little bit of time to introduce some new functions and techniques that we will be using to answer the questions.
Helpful Note: This lab is written to be self-contained, meaning that everything that I will ask of you will be introduced in this lab. There is absolutely no expectation that you memorize everything.
The two exceptions to this are the use of ggplot()
and
the use of dplyr
functions, which were introduced in the
previous lab. Extra guidance will be provided for questions involving
dplyr.
Often we will try and use real data for our problems, but it can also be instructive to generate random data. Not only does this permit us to generate as much data as we wish, it has the added bonus that we are able to know exactly the true values of our population.
For example, we can generate random data that is normally distributed
with the function rnorm()
(“RandomNORMal”). The
rnorm()
function takes three arguments: n
,
indicating the sample size, mean
, indicating the mean, and
sd
, indicating the standard deviation. If I wanted to
generate 5 random variables from a normal distribution with mean value
10 and standard deviation 15, I would do:
rnorm(n = 5, mean = 100, sd = 15)
## [1] 104.95887 85.72174 102.14465 121.27190 68.37742
You’ll note that each time we generate random data, it gives us
something, well, random. We may find ourselves in situations where we
want to be able to recreate exactly a random process. We can do this
with the function set.seed()
:
set.seed(35)
rnorm(5, 100, 15)
## [1] 115.97688 101.99322 99.48934 99.32535 150.06757
set.seed(35)
rnorm(5, 100, 15)
## [1] 115.97688 101.99322 99.48934 99.32535 150.06757
As with everything else in R
, we can assign these random
values to a variable and do things with it. For example, here we collect
a sample of 50 observations and compute the sample mean and standard
deviation
x <- rnorm(50, 100, 15)
# Sample mean
mean(x)
## [1] 103.9545
# Standard deviation
sd(x)
## [1] 14.08321
By default, the rnorm()
function returns a
vector instead of a data.frame. If we wish to plot the random
data we create, we first have to put it in a data.frame
## Sample data and put into data.frame
x <- rnorm(50, 100, 15)
df <- data.frame(x = x)
## Using gh() (assigned above) to make histograms that don't look awful
ggplot(df, aes(x)) +
gh(bins = 10)
In this section, we briefly demonstrated how to generate random data in R. This involved using:
rnorm(n, mean, sd)
function to specify the
parametersmean()
, sd()
, and
length()
to find the mean, standard deviation, and standard
errorggplot()
Question 1: Generate 100 observations from the normal distribution with a mean value of 50 and a standard deviation of 20, then create a histogram showing the distribution of your sample.
Generating data using rnorm()
is equivalent to
collecting a sample from a population of infinite size. Further, each
time we generate data, the observations are all completely different.
What we need next is a way to sample data from an existing,
finite population and we do this with the function
sample()
.
Here, for example, I take a sample of size 5 from the numbers 1 to 10
# This is my "population"
X <- 1:10
# Here, I take a sample of size 10
sample(X, size = 10)
## [1] 8 2 9 10 7 4 3 6 1 5
By default, the sample()
function performs sampling
without replacement, meaning that after a number has been sampled
once, it cannot be sampled again. As such, sampling 10 numbers from the
set 1:10 will give us the same 10 numbers in a different order. We can
sample with replacement by passing a second argument,
replace = TRUE
# Sample with replacement
sample(X, size = 10, replace = TRUE)
## [1] 6 1 8 8 1 1 8 7 2 9
Now I should see that some of my numbers are repeated in the sample. Sampling with replacement is the type of sampling that we will be using with the bootstrap.
We can perform this kind of sampling on variables in a data.frame as
well. Consider the trees
dataset built-in to R which
contains measurements of felled cherry trees. The distribution of tree
heights can be visualized simply enough:
ggplot(trees, aes(Height)) +
gh()
Here, I will sample tree height with replacement and again visualize it
## Sample tree heights
# (if I do not pass in `size = `, by default it will sample the same size)
th <- sample(trees$Height, replace = TRUE)
## Put sampled data in data.frame
df <- data.frame(Height = th)
## Create plot
ggplot(df, aes(Height)) +
gh() +
labs(title = "Distribution of Resampled Tree Height")
This is the equivalent of collecting one bootstrapped sample and plotting its distribution.
In this section, we introduced the sample()
function, as
well as the argument replace = TRUE
that allows us to
sample with replacement. This will be important in the
bootstrapping process.
Question 2: Do the histogram for all felled cherry tree heights and the histogram of the bootstrap sample of cherry tree heights above look similar? Explain.
We previously saw the qnorm()
and qt()
functions which could be used to find quantiles of the normal and
t-distribution respectively.
We started with the coverage we wanted and divide the difference by two. For example, to find the middle 95%, we know we are excluding 5% at the edges; 2.5% on the left side and 2.5% on the right side. This gives us the quantiles (0.025, 0.975). Similarly, if we wanted an 80% confidence interval, our quantiles should be (0.1, 0.9).
When the distribution does not look like a Normal or t-distribution,
we cannot use the qnorm()
or qt()
functions.
We will instead use the quantile()
function in R.
The quantile()
function works by specifying the variable
you want quantiles of and also the specific quantile values. See the
example below to get the 2.5 and 97.5 quantiles for the cherry tree
heights.
quantile(x = trees$Height, probs=c(.025, .975))
## 2.5% 97.5%
## 63.75 86.25
The data below contains the total monthly precipitation in Grinnell, IA from February 2014 to February 2024 (121 total months). We will treat all of these 121 months as the population we are interested in.
The following code creates a histogram of the
precipitation
variable, the amount of rainfall on a rainy
day (inches), and finds the mean and standard deviation of this
variable.
## rain data
rain <- read.csv("http://collinn.github.io/data/grinnell_rain.csv")
ggplot(rain, aes(precip)) +
gh(bins = 8) +
labs(x = "Precipitation (Inches)",
title = "Rain Precipitation 2014-2024\nGrinnell, IA")
# mean
mean(rain$precip)
## [1] 2.069835
# standard deviation
sd(rain$precip)
## [1] 1.966139
Question 3
Use the qnorm()
function and these values for the
mean and standard deviation to find the the 0.025 and 0.975 quantiles.
Does the lower value make sense?
Using the quantile() function find the 0.025 and 0.975 quantiles
of the precip
variable. Do these values seem more
reasonable for the histogram?
Why are the quantiles you found in the two previous parts so different? Think about the histogram shape.
We can think of the process of bootstrapping as following a simple algorithm:
To help us with this procedure, we will need a new function called
bootstrap()
which I provided at the top of this lab. The
bootstrap function takes three arguments: the data I want to bootstrap,
the statistic I want to bootstrap, and the number of
bootstrapped samples I wish to collect (by default, it will collect
1000). We will illustrate its use again using the USArrests
data to find a confidence interval for the number of per capita murders
in the United States.
## Want to bootstrap the sample of rainy days, rain$precip
mean_boot <- bootstrap(rain$precip, mean)
head(mean_boot)
## Sample Statistic
## 1 1 2.269752
## 2 2 2.082810
## 3 3 2.013554
## 4 4 1.930165
## 5 5 2.276942
## 6 6 2.158264
head(mean_boot)
prints out the first 6 rows of the
data.frame returned by bootstrap()
: we see that the first
column indicates the sample number, while the second column indicates
the sample statistic, in our case the mean. Because it is a data.frame,
we can easily plot it:
ggplot(mean_boot, aes(Statistic)) +
gh(bins = 15)
Remember: this is a sampling distribution of the mean value.
We resampled our data 1,000 times, computed the sample mean each time,
and this is the distribution of values that were returned. From this
point, we can use the quantile()
function to return the
quantiles that we want. Because we want the middle 95%, we need to take
our quantiles at 2.5% and 97.5%
quantile(mean_boot$Statistic, probs = c(0.025, 0.975))
## 2.5% 97.5%
## 1.731037 2.436981
Question 4
For this problem, we will assume that we have only collected data on
40 random months in this period, and our goal will be to estimate the
true value of the mean (\(\mu =
2.069\)) given the sample that we have. In other words, we will
be using rs
(Rain Sample) as our observed sample
dataset.
## Create a random index of months for our sample
# (copy these lines into your lab)
set.seed(123)
idx <- sample(1:nrow(rain), size = 40, replace = FALSE)
rs <- rain[idx, ]
# histogram of the sample
ggplot(rs, aes(x=precip)) + gh(bins=10)
Using the random sample of 40 months of Grinnell rain data above, make a bootstrap distribution for the mean using 2000 bootstrap samples.
Use the quantile()
function to get the 2.5 and 97.5
percentiles of the bootstrap distribution. This makes a 95% CI for the
mean precipitation per month in Grinnell. Is the true mean value of
\(\mu = 2.09\) within this
interval?
The following makes a 95% CI for the median, which we couldn’t do with the Normal or t-distribution methods.
med_boot <- bootstrap(rs$precip, median, n=2000)
ggplot(med_boot, aes(x=Statistic)) + gh(bins=10)
quantile(x=med_boot$Statistic, probs=c(.025, .975))
## 2.5% 97.5%
## 1.049875 2.415000