library(dplyr)
library(ggplot2)
# Pretty plots
theme_set(theme_bw())
college <- read.csv("https://remiller1450.github.io/data/Colleges2019_Complete.csv")
We were first introduced to linear regression earlier this semester. We continue this topic today, re-demonstrating the key functions in R that are used for linear regression, as well as some new tools for doing so.
Recall that simple linear regression (as opposed to multivariate linear regression) is a continuous/quantitative outcome variable with one explanatory variable, assumed to be of the form
\[ y = \beta_0 + X\beta_1 + \epsilon \] Note that this formula represents “a line + error”. Once we have estimated values for \(\beta\) with \(\hat{\beta}\), we can also give a predicted value of \(y\) for a given value of \(X\):
\[ \hat{y} = \hat{\beta}_0 + X \hat{\beta}_1 \] Our estimate of the error term is called the residual and is given by the difference between the true value of \(y\) and our prediction:
\[ e_i = y_i - \hat{y}_i \] Ultimately, the predicted values for \(\beta\)’s are those chosen as to minimize the residual squared error, \(\sum_{i=1}^n e_i^2\).
In a non-statistical context, the idea of a line specifying the relationship between two variables is considered a functional relationship, i.e., I can directly find the correct value of \(y\) given a value of \(X\). This is precisely the kind of relationship we would see, for example, when describing the value of a temperature in Celsius (y) given a temperature in Fahrenheit (X)
\[ y = -32 - \left(\frac{5}{9}\right) X \]
In contrast, we may be interested in establishing a statistical linear relationship between two variables, such as height and weight. In this case, there is no straight line that would perfectly equate a value for weight given a certain height:
When describing a statistical relationship between two variables, it is custom to account for this variability about the line with the inclusion of an error term, \(\epsilon\) (epsilon):
\[ y = \beta_0 + X \beta_1 + \epsilon \]
In other words, we assume there is a linear relationship between \(y\) and \(X\) (given by \(y = \beta_0 + X \beta_1\)), plus some error term. The difference between our observered value of \(y\) and the predicted value \(\hat{y}\) is our estimate of the error term, also called a residual, i.e., \(e_i = y_i - \hat{y}_i\)
One of the primary assumptions for linear regression is that the error terms are normally distributed with a mean value of \(0\) and variability of \(\sigma^2\). Notationally, this is written
\[ \epsilon \sim N(0, \sigma^2) \] In plain language, this means that we expect, on average for our response variable to fall near our line of best fit, but that the error around this line should be plus or minus some value. By investigating our residuals, we can accomplish two goals:
Once we have fit a model with lm(), we can access the
residuals using the $, just as we would for a data
frame.
## Fit model
fit <- lm(hp ~ wt, mtcars)
## Get residuals
res <- fit$residuals
## Plot them
hist(res, breaks = 8)
Here we see a slight skew, indicating that there are more large
(positive) residuals than there large (negative) ones, yet most appear
to be centered around 0.
Also useful is to plot residuals against the original variable in the
model, in this case, weight. Below is a plot that shows how the
residuals look for each value of wt
## Use fit$residuals in the y spot since `residuals` does not exist in mtcars
ggplot(mtcars, aes(wt, fit$residuals)) + geom_point() +
geom_hline(yintercept = 0, color = 'red', linetype = "dashed")
Here, we see the same thing again – many residuals are close to zero,
with a handful of very large residuals that make up the right side of
the skew. While it doesn’t appear that the residuals here are
exceptionally normal, they are likely close enough for our purposes.
Contrast this with the model for height and weight we saw above, which has errors that are very clearly not normally distributed:
fit <- lm(weight ~ height, women)
ggplot(women, aes(height, weight)) +
geom_smooth(method = lm, se = FALSE) +
geom_point() + ggtitle("Fitted Line")
ggplot(women, aes(height, fit$residuals)) +
geom_point() + ggtitle("Residual Plot") +
geom_hline(yintercept = 0, color = 'red', linetype = "dashed")
fit <- lm(disp ~ wt, mtcars)
ggplot(mtcars, aes(wt, fit$residuals)) +
geom_point() + ggtitle("Residual Plot") +
geom_hline(yintercept = 0, color = 'red', linetype = "dashed")
The Normal QQ Plot (quantile-quantile) compares the quantiles of your data to the theoretical distributions of a Normal distribution that has the same mean and standard deviation. You should expect to see a straight line if your data seems to resemble a Normal distribution.
The function used to assess this is qqnorm(). The input is then the variable we want to find out if it has a Normal distribution. Recall that the distribution for residuals in the Linear Model using wt to predict `MPG’ was skewed. This is how we get the normal QQ plot.
# res = residuals from linear model
qqnorm(res)
# add line for comparison
qqline(res)
For the dogs dataset, we could construct linear models
predicting speed using combinations of color,
size, or breed.
dogs <- read.csv("https://collinn.github.io/data/dogs.csv")
head(dogs)
## breed speed color size
## 1 Greyhound 34.45972 black large
## 2 Greyhound 40.15329 white large
## 3 Greyhound 36.94788 black large
## 4 Greyhound 40.84922 white large
## 5 Greyhound 45.70215 black large
## 6 Greyhound 42.97792 white large
Residuals help us see how well the model is performing. Answer the following questions related to residuals.
dogs %>% ggplot(aes(x=speed, y = color)) + geom_boxplot() +
geom_jitter(height = 0.1) # height controls vertical space in jitter
Part A: (From lecture): Based on the boxplots, does it look like color is going to help you make very accurate predictions of speed?
Part B: Use the following code to make a linear model predicting speed using color. Based on the summary output, argue for color being a useful predictor of speed. Then, also using the summary output, argue that color is not a useful predictor for speed.
model_color = lm(speed~color, data=dogs)
model_color %>% summary()
##
## Call:
## lm(formula = speed ~ color, data = dogs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.1100 -9.8896 0.0764 9.2445 24.8540
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.8482 0.8005 26.045 <2e-16 ***
## colorbrown 1.9612 1.7899 1.096 0.2739
## colorwhite 4.2360 1.7899 2.367 0.0184 *
## coloryellow -2.3968 1.3864 -1.729 0.0846 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.32 on 396 degrees of freedom
## Multiple R-squared: 0.03152, Adjusted R-squared: 0.02418
## F-statistic: 4.296 on 3 and 396 DF, p-value: 0.005334
Part C: Using the residual plot for the linear regression, argue whether you think color is useful for predicting speed.
ggplot(dogs, aes(speed, model_color$residuals, color=color)) +
geom_point() + ggtitle("Residual Plot") +
geom_hline(yintercept = 0, color = 'red', linetype = "dashed")
# pick better color choices
Part D: Using the following plots argue whether the Normality and Homoscedescatic conditions are met for linear model, and what this implies about how well the model is doing.
# histogram of residuals
ggplot(dogs, aes(y=model_color$residuals)) +
geom_histogram()
# Normal QQ Plot
qqnorm(model_color$residuals)
qqline(model_color$residuals)
Part E: Add breed information into the
residual plots as a color/fill in the aes() argument.
Explain what patterns you see in the residuals that indicate
breed could be a useful predictor.
Goal: After identifying breed being a potentially useful predictor, we are going to repeat questions several of the parts in Question 1.
Part A: Make a boxplots of the dog breeds and their speeds. Based on the boxplot, explain how it look like color is going to help you make very accurate predictions of speed?
Part B: Fit a linear model predicting speed using dog breed. Based on the summary output, argue for breed being a useful predictor of speed.
Part C: Using the following residual plots, do you see any violation of model assumptions? Explain.
We are going back to the college dataset for this problem.
college <- read.csv("https://remiller1450.github.io/data/Colleges2019_Complete.csv")
Part A: Use the following code to make a linear
model predicting Enrollment based on whether the college is
private or public (we did something similar in the previous lab). Based
on the summary output, is whether a college is public or private a
useful predictor of enrollment?
fit = lm(Enrollment ~ Private, data=college)
fit %>% summary()
Part B: Let’s say I wanted a more sophisticated model. Which model below performs better for prediction? Explain using the following summary output.
lm(Enrollment ~ Private + ACT_median, data=college) %>% summary()
lm(Enrollment ~ Private + Debt_median, data=college) %>% summary()
Part C: Based on the following boxplots, explain how we could have identified which variable would be more helpful in Part B. (Hint: which variable is going to give us information we kind of already know about?)
college %>% ggplot(aes(x = ACT_median, y = Private)) +
geom_boxplot()
college %>% ggplot(aes(x = Debt_median, y = Private)) +
geom_boxplot()
Part D: Check the residual conditions for the better fitting model in Part B by looking at the QQPlot, the histogram of residuals, and also the plot of residuals vs the quantitative predictor. For these last two plots, also color/fill the residuals by Private vs. Public. What do you see?
Part E: Some might argue that checking residual conditions demonstrates a ‘prediction’ mindset (see reading assignment). Make an arguement for this statement.
Part F: A more ‘explanation focused’ approach may still find value using the better regression model in part C even if the conditions/assumptions are not met. Using the summary output, generally explain the relationship between Enrollment and the predictors. Did we still learn something? Note: Violation of the assumptions conditions actually means that the p-values are not accurate (how inaccurate is complicated). The results can be used to guide future research as an exploratory analysis.