library(dplyr)
library(ggplot2)
# Pretty plots
theme_set(theme_bw())
college <- read.csv("https://remiller1450.github.io/data/Colleges2019_Complete.csv")
We were first introduced to linear regression earlier this semester. We continue this topic today, re-demonstrating the key functions in R that are used for linear regression, as well as some new tools for doing so.
Recall that simple linear regression (as opposed to multivariate linear regression) is a continuous/quantitative outcome variable with one explanatory variable, assumed to be of the form
\[ y = \beta_0 + X\beta_1 + \epsilon \] Note that this formula represents “a line + error”. Once we have estimated values for \(\beta\) with \(\hat{\beta}\), we can also give a predicted value of \(y\) for a given value of \(X\):
\[ \hat{y} = \hat{\beta}_0 + X \hat{\beta}_1 \] Our estimate of the error term is called the residual and is given by the difference between the true value of \(y\) and our prediction:
\[ e_i = y_i - \hat{y}_i \] Ultimately, the predicted values for \(\beta\)’s are those chosen as to minimize the residual squared error, \(\sum_{i=1}^n e_i^2\).
For practice, we will use the mtcars
dataset that comes
built-in to R. You can use ?mtcars
to get a description of
each of the data variables
data(mtcars)
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Similar to ANOVA and the t-test functions, the lm()
function (Linear Model) operates using
a formula syntax with an outcome and predictor variables. Here, we use
lm()
to estimate a vehicles miles per gallon
(mpg
) using the size of the engine (disp
)
## First fit model with lm
fit <- lm(mpg ~ disp, mtcars)
## Summary function to display output
summary(fit)
##
## Call:
## lm(formula = mpg ~ disp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8922 -2.2022 -0.9631 1.6272 7.2305
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.599855 1.229720 24.070 < 2e-16 ***
## disp -0.041215 0.004712 -8.747 9.38e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.251 on 30 degrees of freedom
## Multiple R-squared: 0.7183, Adjusted R-squared: 0.709
## F-statistic: 76.51 on 1 and 30 DF, p-value: 9.38e-10
From here, we get a lot of useful summary information. Try to identify the following:
It’s worth observing that the effect size, or the value of
\(\hat{\beta}\), isn’t particularly
large: without more context, it is difficult to see if this is a
consequence of the effect size being small or if it is a consequence of
differences in scale between mpg and engine displacement. We can
investigate this further by creating a plot where we find evidence for
the latter. Additionally, recall that we can also use the
geom_smooth()
function with ggplot to illustrate how our
regression line will look
ggplot(mtcars, aes(disp, mpg)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) # remove standard error
The apparent strength of this relationship corroborates what we found in
the linear model summary information above.
Question 1: For this question, we will be using the
mtcars
dataset to investigate the relationship between
horsepower (hp
) and vehicle weight (wt
)
geom_smooth(method = lm)
to add a linear regression line.
How does this relationship appear?We have seen previously that we can extend our linear model to include more than one predictor or independent variables. For example, in one of our studies we consider the growth of odontoblasts in guinea pigs. In this study, there were two separate variables of interest: the dose of vitamin C delivered and the route of administration (either orange juice or ascorbic acid). The linear model constructed looked like this:
data(ToothGrowth)
lm(len ~ dose + supp, ToothGrowth) %>% summary()
##
## Call:
## lm(formula = len ~ dose + supp, data = ToothGrowth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.600 -3.700 0.373 2.116 8.800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.2725 1.2824 7.231 1.31e-09 ***
## dose 9.7636 0.8768 11.135 6.31e-16 ***
## suppVC -3.7000 1.0936 -3.383 0.0013 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.236 on 57 degrees of freedom
## Multiple R-squared: 0.7038, Adjusted R-squared: 0.6934
## F-statistic: 67.72 on 2 and 57 DF, p-value: 8.716e-16
Written as an equation, this has the form
\[ \hat{y} = 9.27 + 9.764 \times (\text{dose}) - 3.7 \times \mathbb{1}_{VC} \] That is, we find:
dose
never takes on the
value 0)A plot of this regression looks like this:
Looking at this, we see indeed that the intercept for the red line is about 9.2, with the intercept for ascorbic acid is shifted 3.7 lower. The slope for these two lines remains the same.
Question 2: Consider the dog dataset we used in class
dogs <- read.csv("https://collinn.github.io/data/dogs.csv")
speed
as the dependent variable and include both
color
and size
as the independent variables.
Looking at the summary output, what variable(s) have become the
reference variables?Question 3: For this question, we will be creating a linear model to predict enrollment at primary undergraduate institutions in the United States
college <- read.csv("https://remiller1450.github.io/data/Colleges2019_Complete.csv")
In a non-statistical context, the idea of a line specifying the relationship between two variables is considered a functional relationship, i.e., I can directly find the correct value of \(y\) given a value of \(X\). This is precisely the kind of relationship we would see, for example, when describing the value of a temperature in Celsius (y) given a temperature in Fahrenheit (X)
\[ y = -32 - \left(\frac{5}{9}\right) X \]
In contrast, we may be interested in establishing a statistical linear relationship between two variables, such as height and weight. In this case, there is no straight line that would perfectly equate a value for weight given a certain height:
When describing a statistical relationship between two variables, it is custom to account for this variability about the line with the inclusion of an error term, \(\epsilon\) (epsilon):
\[ y = \beta_0 + X \beta_1 + \epsilon \]
In other words, we assume there is a linear relationship between \(y\) and \(X\) (given by \(y = \beta_0 + X \beta_1\)), plus some error term. The difference between our observered value of \(y\) and the predicted value \(\hat{y}\) is our estimate of the error term, also called a residual, i.e., \(e_i = y_i - \hat{y}_i\)
One of the primary assumptions for linear regression is that the error terms are normally distributed with a mean value of \(0\) and variability of \(\sigma^2\). Notationally, this is written
\[ \epsilon \sim N(0, \sigma^2) \] In plain language, this means that we expect, on average for our response variable to fall near our line of best fit, but that the error around this line should be plus or minus some value. By investigating our residuals, we can accomplish two goals:
Once we have fit a model with lm()
, we can access the
residuals using the $
, just as we would for a data frame.
Here, for example, are the residuals from the model fit in Question
1
## Fit model
fit <- lm(hp ~ wt, mtcars)
## Get residuals
res <- fit$residuals
## Plot them
hist(res, breaks = 8)
Here we see a slight skew, indicating that there are more large
(positive) residuals than there large (negative) ones, yet most appear
to be centered around 0.
Also useful is to plot residuals against the original variable in the
model, in this case, weight. Below is a plot that shows how the
residuals look for each value of wt
## Use fit$residuals in the y spot since `residuals` does not exist in mtcars
ggplot(mtcars, aes(wt, fit$residuals)) + geom_point() +
geom_hline(yintercept = 0, color = 'red', linetype = "dashed")
Here, we see the same thing again – many residuals are close to zero,
with a handful of very large residuals that make up the right side of
the skew. While it doesn’t appear that the residuals here are
exceptionally normal, they are likely close enough for our purposes.
Contrast this with the model for height and weight we saw above, which has errors that are very clearly not normally distributed:
fit <- lm(weight ~ height, women)
ggplot(women, aes(height, weight)) +
geom_smooth(method = lm, se = FALSE) +
geom_point() + ggtitle("Fitted Line")
ggplot(women, aes(height, fit$residuals)) +
geom_point() + ggtitle("Residual Plot") +
geom_hline(yintercept = 0, color = 'red', linetype = "dashed")
fit <- lm(disp ~ wt, mtcars)
ggplot(mtcars, aes(wt, fit$residuals)) +
geom_point() + ggtitle("Residual Plot") +
geom_hline(yintercept = 0, color = 'red', linetype = "dashed")
For the dogs
dataset, construct a histogram of residuals
to check Normality assumption for linear regression for a model
predicting speed
with breed
.
Continuation of using the college
dataset in question 3.
For the model predicting enrollment using ACT_median
,
Debt_median
, and Private
, check the residual
assumptions by looking at histogram of residuals, and also plot
residuals vs the two quantitative variables. For these last two plots,
also color the residuals by Private vs. Public. What do you see?