library(dplyr)
library(ggplot2)
# Pretty plots
theme_set(theme_bw())
college <- read.csv("https://remiller1450.github.io/data/Colleges2019_Complete.csv")
We were first introduced to linear regression earlier this semester. We continue this topic today, re-demonstrating the key functions in R that are used for linear regression, as well as some new tools for doing so.
Recall that simple linear regression (as opposed to multivariate linear regression) is a continuous/quantitative outcome variable with one explanatory variable, assumed to be of the form
\[ y = \beta_0 + X\beta_1 + \epsilon \] Note that this formula represents “a line + error”. Once we have estimated values for \(\beta\) with \(\hat{\beta}\), we can also give a predicted value of \(y\) for a given value of \(X\):
\[ \hat{y} = \hat{\beta}_0 + X \hat{\beta}_1 \] Our estimate of the error term is called the residual and is given by the difference between the true value of \(y\) and our prediction:
\[ e_i = y_i - \hat{y}_i \] Ultimately, the predicted values for \(\beta\)’s are those chosen as to minimize the residual squared error, \(\sum_{i=1}^n e_i^2\).
For practice, we will use the mtcars
dataset that comes
built-in to R. You can use ?mtcars
to get a description of
each of the data variables
data(mtcars)
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Similar to ANOVA and the t-test functions, the lm()
function (Linear Model) operates using
a formula syntax with an outcome and predictor variables. Here, we use
lm()
to estimate a vehicles miles per gallon
(mpg
) using the size of the engine (disp
)
## First fit model with lm
fit <- lm(mpg ~ disp, mtcars)
## Summary function to display output
summary(fit)
##
## Call:
## lm(formula = mpg ~ disp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8922 -2.2022 -0.9631 1.6272 7.2305
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.599855 1.229720 24.070 < 2e-16 ***
## disp -0.041215 0.004712 -8.747 9.38e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.251 on 30 degrees of freedom
## Multiple R-squared: 0.7183, Adjusted R-squared: 0.709
## F-statistic: 76.51 on 1 and 30 DF, p-value: 9.38e-10
From here, we get a lot of useful summary information. Try to identify the following:
It’s worth observing that the effect size, or the value of
\(\hat{\beta}\), isn’t particularly
large: without more context, it is difficult to see if this is a
consequence of the effect size being small or if it is a consequence of
differences in scale between mpg and engine displacement. We can
investigate this further by creating a plot where we find evidence for
the latter. Additionally, recall that we can also use the
geom_smooth()
function with ggplot to illustrate how our
regression line will look
ggplot(mtcars, aes(disp, mpg)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) # remove standard error
The apparent strength of this relationship corroborates what we found in
the linear model summary information above.
Question 1: For this question, we will be using the
mtcars
dataset to investigate the relationship between
horsepower (hp
) and vehicle weight (wt
)
geom_smooth(method = lm)
to add a linear regression line.
How does this relationship appear?Solutions:
Part A: The null hypothesis is that there is no linear relationship between the 2 variables.
Part B
ggplot(mtcars, aes(x=wt, y=hp)) + geom_point() + geom_smooth(method=lm, se=F)
Maybe linear. Might be curved.
Part C
lm(hp ~ wt, mtcars) %>% summary()
##
## Call:
## lm(formula = hp ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -83.430 -33.596 -13.587 7.913 172.030
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.821 32.325 -0.056 0.955
## wt 46.160 9.625 4.796 4.15e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 52.44 on 30 degrees of freedom
## Multiple R-squared: 0.4339, Adjusted R-squared: 0.4151
## F-statistic: 23 on 1 and 30 DF, p-value: 4.146e-05
Part D R2 is 0.4339. 44% of the variability in horsepower can be explained with our linear regression model using weight as a predictor.
Part E
Intercept = -1.821. For a vehicle with 0 weight, the predicted horsepower is -1.821. Doesn’t make sense.
Slope = 46.16. For a one-ton increase in weight, the predicted horsepower increases by 46.16.
Part F
# predicted mpg
29.599855 - .041215*300
## [1] 17.23536
# residual
15 - 29.599855 - .041215*300
## [1] -26.96436
We have seen previously that we can extend our linear model to include more than one predictor or independent variables. For example, in one of our studies we consider the growth of odontoblasts in guinea pigs. In this study, there were two separate variables of interest: the dose of vitamin C delivered and the route of administration (either orange juice or ascorbic acid). The linear model constructed looked like this:
data(ToothGrowth)
lm(len ~ dose + supp, ToothGrowth) %>% summary()
##
## Call:
## lm(formula = len ~ dose + supp, data = ToothGrowth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.600 -3.700 0.373 2.116 8.800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.2725 1.2824 7.231 1.31e-09 ***
## dose 9.7636 0.8768 11.135 6.31e-16 ***
## suppVC -3.7000 1.0936 -3.383 0.0013 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.236 on 57 degrees of freedom
## Multiple R-squared: 0.7038, Adjusted R-squared: 0.6934
## F-statistic: 67.72 on 2 and 57 DF, p-value: 8.716e-16
Written as an equation, this has the form
\[ \hat{y} = 9.27 + 9.764 \times (\text{dose}) - 3.7 \times \mathbb{1}_{VC} \] That is, we find:
dose
never takes on the
value 0)A plot of this regression looks like this:
Looking at this, we see indeed that the intercept for the red line is about 9.2, with the intercept for ascorbic acid is shifted 3.7 lower. The slope for these two lines remains the same.
Question 2: Consider the dog dataset we used in class
dogs <- read.csv("https://collinn.github.io/data/dogs.csv")
speed
as the dependent variable and include both
color
and size
as the independent variables.
Looking at the summary output, what variable(s) have become the
reference variables?lm(speed ~ color + size, dogs) %>% summary()
##
## Call:
## lm(formula = speed ~ color + size, data = dogs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.1940 -9.0959 0.0206 9.6088 22.7700
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.932 1.005 22.825 < 2e-16 ***
## colorbrown 4.045 1.873 2.160 0.031353 *
## colorwhite 4.761 1.815 2.623 0.009053 **
## coloryellow -3.701 1.421 -2.605 0.009541 **
## sizemedium -5.217 1.529 -3.412 0.000711 ***
## sizesmall -3.119 1.458 -2.139 0.033010 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.17 on 394 degrees of freedom
## Multiple R-squared: 0.06122, Adjusted R-squared: 0.04931
## F-statistic: 5.139 on 5 and 394 DF, p-value: 0.0001402
# large black dogs are the reference, as these are not mentioned in coefficients
# The tests are for whether the coefficients are zero,
# not whether they are different from each other
22.932 - 5.217 + 4.761
## [1] 22.476
Question 3: For this question, we will be creating a linear model to predict enrollment at primary undergraduate institutions in the United States
college <- read.csv("https://remiller1450.github.io/data/Colleges2019_Complete.csv")
lm(Enrollment ~ ACT_median + Private, college) %>% summary()
##
## Call:
## lm(formula = Enrollment ~ ACT_median + Private, data = college)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13197 -3309 -355 1600 43985
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -16565.59 1328.19 -12.47 <2e-16 ***
## ACT_median 808.32 54.66 14.79 <2e-16 ***
## PrivatePublic 9147.79 395.55 23.13 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6408 on 1092 degrees of freedom
## Multiple R-squared: 0.3893, Adjusted R-squared: 0.3882
## F-statistic: 348 on 2 and 1092 DF, p-value: < 2.2e-16
Very generally, as median ACT goes up, so does enrollment. Public colleges have more enrollment on average than private.
Part B: Create two scatter plots investigating the relationship between ACT_median and both Debt_median and Salary10yr_median. Which appears to be most correlated with ACT_median? Which do you think would be a better variable to include in our linear model?
Part C: Create two additional linear models for predicting Enrollment, one adding Debt_median to the model in Part A, the other adding Salary10yr_median. Is what you found consistent with what you chose in Part B? What metric(s) would you use in support of your choice.
lm(Enrollment ~ ACT_median + Salary10yr_median + Private, college) %>% summary()
##
## Call:
## lm(formula = Enrollment ~ ACT_median + Salary10yr_median + Private,
## data = college)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13199 -3335 -390 1570 44107
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.652e+04 1.330e+03 -12.419 <2e-16 ***
## ACT_median 7.705e+02 7.331e+01 10.510 <2e-16 ***
## Salary10yr_median 1.827e-02 2.362e-02 0.774 0.439
## PrivatePublic 9.168e+03 3.965e+02 23.123 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6409 on 1091 degrees of freedom
## Multiple R-squared: 0.3896, Adjusted R-squared: 0.3879
## F-statistic: 232.1 on 3 and 1091 DF, p-value: < 2.2e-16
lm(Enrollment ~ ACT_median + Debt_median + Private, college) %>% summary()
##
## Call:
## lm(formula = Enrollment ~ ACT_median + Debt_median + Private,
## data = college)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13021 -3239 -393 1564 43696
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.389e+04 1.510e+03 -9.199 < 2e-16 ***
## ACT_median 8.453e+02 5.528e+01 15.291 < 2e-16 ***
## Debt_median -1.974e-01 5.395e-02 -3.658 0.000266 ***
## PrivatePublic 8.699e+03 4.120e+02 21.112 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6372 on 1091 degrees of freedom
## Multiple R-squared: 0.3967, Adjusted R-squared: 0.395
## F-statistic: 239.1 on 3 and 1091 DF, p-value: < 2.2e-16
Going by R2, both models perform roughly the same. In the first model, the Salary 10yr Median is not statistically significant.
In a non-statistical context, the idea of a line specifying the relationship between two variables is considered a functional relationship, i.e., I can directly find the correct value of \(y\) given a value of \(X\). This is precisely the kind of relationship we would see, for example, when describing the value of a temperature in Celsius (y) given a temperature in Fahrenheit (X)
\[ y = -32 - \left(\frac{5}{9}\right) X \]
In contrast, we may be interested in establishing a statistical linear relationship between two variables, such as height and weight. In this case, there is no straight line that would perfectly equate a value for weight given a certain height:
When describing a statistical relationship between two variables, it is custom to account for this variability about the line with the inclusion of an error term, \(\epsilon\) (epsilon):
\[ y = \beta_0 + X \beta_1 + \epsilon \]
In other words, we assume there is a linear relationship between \(y\) and \(X\) (given by \(y = \beta_0 + X \beta_1\)), plus some error term. The difference between our observered value of \(y\) and the predicted value \(\hat{y}\) is our estimate of the error term, also called a residual, i.e., \(e_i = y_i - \hat{y}_i\)
One of the primary assumptions for linear regression is that the error terms are normally distributed with a mean value of \(0\) and variability of \(\sigma^2\). Notationally, this is written
\[ \epsilon \sim N(0, \sigma^2) \] In plain language, this means that we expect, on average for our response variable to fall near our line of best fit, but that the error around this line should be plus or minus some value. By investigating our residuals, we can accomplish two goals:
Once we have fit a model with lm()
, we can access the
residuals using the $
, just as we would for a data frame.
Here, for example, are the residuals from the model fit in Question
1
## Fit model
fit <- lm(hp ~ wt, mtcars)
## Get residuals
res <- fit$residuals
## Plot them
hist(res, breaks = 8)
Here we see a slight skew, indicating that there are more large
(positive) residuals than there large (negative) ones, yet most appear
to be centered around 0.
Also useful is to plot residuals against the original variable in the
model, in this case, weight. Below is a plot that shows how the
residuals look for each value of wt
## Use fit$residuals in the y spot since `residuals` does not exist in mtcars
ggplot(mtcars, aes(wt, fit$residuals)) + geom_point() +
geom_hline(yintercept = 0, color = 'red', linetype = "dashed")
Here, we see the same thing again – many residuals are close to zero,
with a handful of very large residuals that make up the right side of
the skew. While it doesn’t appear that the residuals here are
exceptionally normal, they are likely close enough for our purposes.
Contrast this with the model for height and weight we saw above, which has errors that are very clearly not normally distributed:
fit <- lm(weight ~ height, women)
ggplot(women, aes(height, weight)) +
geom_smooth(method = lm, se = FALSE) +
geom_point() + ggtitle("Fitted Line")
ggplot(women, aes(height, fit$residuals)) +
geom_point() + ggtitle("Residual Plot") +
geom_hline(yintercept = 0, color = 'red', linetype = "dashed")
fit <- lm(disp ~ wt, mtcars)
ggplot(mtcars, aes(wt, fit$residuals)) +
geom_point() + ggtitle("Residual Plot") +
geom_hline(yintercept = 0, color = 'red', linetype = "dashed")
For the dogs
dataset, construct a histogram of residuals
to check Normality assumption for linear regression for a model
predicting speed
with breed
.
model = lm(speed ~ breed, data=dogs)
ggplot(dogs, aes(x=model$residuals)) +
geom_histogram()
Continuation of using the college
dataset in question 3.
For the model predicting enrollment using ACT_median
,
Debt_median
, and Private
, check the residual
assumptions by looking at histogram of residuals, and also plot
residuals vs the two quantitative variables. For these last two plots,
also color the residuals by Private vs. Public. What do you see?
fit = lm(Enrollment ~ ACT_median + Debt_median + Private, college) %>% summary()
ggplot(data=college, aes(x=fit$residuals)) + geom_histogram()
ggplot(college, aes(ACT_median, fit$residuals)) +
geom_point() + ggtitle("Residual Plot") +
geom_hline(yintercept = 0, color = 'red', linetype = "dashed")
ggplot(college, aes(Debt_median, fit$residuals)) +
geom_point() + ggtitle("Residual Plot") +
geom_hline(yintercept = 0, color = 'red', linetype = "dashed")
ggplot(college, aes(ACT_median, fit$residuals, color=Private)) +
geom_point() + ggtitle("Residual Plot") +
geom_hline(yintercept = 0, color = 'red', linetype = "dashed")
ggplot(college, aes(Debt_median, fit$residuals, color=Private)) +
geom_point() + ggtitle("Residual Plot") +
geom_hline(yintercept = 0, color = 'red', linetype = "dashed")