library(dplyr)
library(ggplot2)

# Pretty plots
theme_set(theme_bw())

college <- read.csv("https://remiller1450.github.io/data/Colleges2019_Complete.csv")

Refresh of Linear Regression in R

We were first introduced to linear regression earlier this semester. We continue this topic today, re-demonstrating the key functions in R that are used for linear regression, as well as some new tools for doing so.

Simple Linear Regression

Recall that simple linear regression (as opposed to multivariate linear regression) is a continuous/quantitative outcome variable with one explanatory variable, assumed to be of the form

\[ y = \beta_0 + X\beta_1 + \epsilon \] Note that this formula represents “a line + error”. Once we have estimated values for \(\beta\) with \(\hat{\beta}\), we can also give a predicted value of \(y\) for a given value of \(X\):

\[ \hat{y} = \hat{\beta}_0 + X \hat{\beta}_1 \] Our estimate of the error term is called the residual and is given by the difference between the true value of \(y\) and our prediction:

\[ e_i = y_i - \hat{y}_i \] Ultimately, the predicted values for \(\beta\)’s are those chosen as to minimize the residual squared error, \(\sum_{i=1}^n e_i^2\).

Error

In a non-statistical context, the idea of a line specifying the relationship between two variables is considered a functional relationship, i.e., I can directly find the correct value of \(y\) given a value of \(X\). This is precisely the kind of relationship we would see, for example, when describing the value of a temperature in Celsius (y) given a temperature in Fahrenheit (X)

\[ y = -32 - \left(\frac{5}{9}\right) X \]

In contrast, we may be interested in establishing a statistical linear relationship between two variables, such as height and weight. In this case, there is no straight line that would perfectly equate a value for weight given a certain height:

When describing a statistical relationship between two variables, it is custom to account for this variability about the line with the inclusion of an error term, \(\epsilon\) (epsilon):

\[ y = \beta_0 + X \beta_1 + \epsilon \]

In other words, we assume there is a linear relationship between \(y\) and \(X\) (given by \(y = \beta_0 + X \beta_1\)), plus some error term. The difference between our observered value of \(y\) and the predicted value \(\hat{y}\) is our estimate of the error term, also called a residual, i.e., \(e_i = y_i - \hat{y}_i\)

One of the primary assumptions for linear regression is that the error terms are normally distributed with a mean value of \(0\) and variability of \(\sigma^2\). Notationally, this is written

\[ \epsilon \sim N(0, \sigma^2) \] In plain language, this means that we expect, on average for our response variable to fall near our line of best fit, but that the error around this line should be plus or minus some value. By investigating our residuals, we can accomplish two goals:

  1. First, much of the theory around linear regression rests on the assumptoin that the errors are normally distributed. If this is not true, we may have issues in making future predictions
  2. Second, patterns in our residuals can indicate that we are missing critical information. By comparing our residuals with missing variables, we can often identify additional variables to include in our model

Once we have fit a model with lm(), we can access the residuals using the $, just as we would for a data frame.

## Fit model
fit <- lm(hp ~ wt, mtcars)

## Get residuals
res <- fit$residuals

## Plot them
hist(res, breaks = 8)

Here we see a slight skew, indicating that there are more large (positive) residuals than there large (negative) ones, yet most appear to be centered around 0.

Also useful is to plot residuals against the original variable in the model, in this case, weight. Below is a plot that shows how the residuals look for each value of wt

## Use fit$residuals in the y spot since `residuals` does not exist in mtcars
ggplot(mtcars, aes(wt, fit$residuals)) + geom_point() +
  geom_hline(yintercept = 0, color = 'red', linetype = "dashed")

Here, we see the same thing again – many residuals are close to zero, with a handful of very large residuals that make up the right side of the skew. While it doesn’t appear that the residuals here are exceptionally normal, they are likely close enough for our purposes.

Contrast this with the model for height and weight we saw above, which has errors that are very clearly not normally distributed:

fit <- lm(weight ~ height, women)

ggplot(women, aes(height, weight)) +
  geom_smooth(method = lm, se = FALSE) +
  geom_point() + ggtitle("Fitted Line")

ggplot(women, aes(height, fit$residuals)) +
  geom_point() + ggtitle("Residual Plot") +
  geom_hline(yintercept = 0, color = 'red', linetype = "dashed")

fit <- lm(disp ~ wt, mtcars)

ggplot(mtcars, aes(wt, fit$residuals)) +
  geom_point() + ggtitle("Residual Plot") +
  geom_hline(yintercept = 0, color = 'red', linetype = "dashed")

Checking to see if a distribution is Normal (QQplot)

The Normal QQ Plot (quantile-quantile) compares the quantiles of your data to the theoretical distributions of a Normal distribution that has the same mean and standard deviation. You should expect to see a straight line if your data seems to resemble a Normal distribution.

The function used to assess this is qqnorm(). The input is then the variable we want to find out if it has a Normal distribution. Recall that the distribution for residuals in the Linear Model using wt to predict `MPG’ was skewed. This is how we get the normal QQ plot.

# res = residuals from linear model
qqnorm(res)

# add line for comparison
qqline(res)

Question 1

For the dogs dataset, we could construct linear models predicting speed using combinations of color, size, or breed.

dogs <- read.csv("https://collinn.github.io/data/dogs.csv")
head(dogs)
##       breed    speed color  size
## 1 Greyhound 34.45972 black large
## 2 Greyhound 40.15329 white large
## 3 Greyhound 36.94788 black large
## 4 Greyhound 40.84922 white large
## 5 Greyhound 45.70215 black large
## 6 Greyhound 42.97792 white large

Residuals help us see how well the model is performing. Answer the following questions related to residuals.

dogs %>% ggplot(aes(x=speed, y = color)) + geom_boxplot() +
  geom_jitter(height = 0.1) # height controls vertical space in jitter

Part A: (From lecture): Based on the boxplots, does it look like color is going to help you make very accurate predictions of speed?

Part B: Use the following code to make a linear model predicting speed using color. Based on the summary output, argue for color being a useful predictor of speed. Then, also using the summary output, argue that color is not a useful predictor for speed.

model_color = lm(speed~color, data=dogs)
model_color %>% summary()
## 
## Call:
## lm(formula = speed ~ color, data = dogs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.1100  -9.8896   0.0764   9.2445  24.8540 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  20.8482     0.8005  26.045   <2e-16 ***
## colorbrown    1.9612     1.7899   1.096   0.2739    
## colorwhite    4.2360     1.7899   2.367   0.0184 *  
## coloryellow  -2.3968     1.3864  -1.729   0.0846 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.32 on 396 degrees of freedom
## Multiple R-squared:  0.03152,    Adjusted R-squared:  0.02418 
## F-statistic: 4.296 on 3 and 396 DF,  p-value: 0.005334

Part C: Using the residual plot for the linear regression, argue whether you think color is useful for predicting speed.

ggplot(dogs, aes(speed, model_color$residuals, color=color)) +
  geom_point() + ggtitle("Residual Plot") +
  geom_hline(yintercept = 0, color = 'red', linetype = "dashed")

# pick better color choices

Part D: Using the following plots argue whether the Normality and Homoscedescatic conditions are met for linear model, and what this implies about how well the model is doing.

# histogram of residuals
ggplot(dogs, aes(y=model_color$residuals)) +
  geom_histogram()

# Normal QQ Plot
qqnorm(model_color$residuals)
qqline(model_color$residuals)

Part E: Add breed information into the residual plots as a color/fill in the aes() argument. Explain what patterns you see in the residuals that indicate breed could be a useful predictor.

Question 2

Goal: After identifying breed being a potentially useful predictor, we are going to repeat questions several of the parts in Question 1.

Part A: Make a boxplots of the dog breeds and their speeds. Based on the boxplot, explain how it look like color is going to help you make very accurate predictions of speed?

Part B: Fit a linear model predicting speed using dog breed. Based on the summary output, argue for breed being a useful predictor of speed.

Part C: Using the following residual plots, do you see any violation of model assumptions? Explain.

Question 3

We are going back to the college dataset for this problem.

college <- read.csv("https://remiller1450.github.io/data/Colleges2019_Complete.csv")

Part A: Use the following code to make a linear model predicting Enrollment based on whether the college is private or public (we did something similar in the previous lab). Based on the summary output, is whether a college is public or private a useful predictor of enrollment?

fit = lm(Enrollment ~ Private, data=college)
fit %>% summary()

Part B: Let’s say I wanted a more sophisticated model. Which model below performs better for prediction? Explain using the following summary output.

lm(Enrollment ~ Private + ACT_median, data=college) %>% summary()
lm(Enrollment ~ Private + Debt_median, data=college) %>% summary()

Part C: Based on the following boxplots, explain how we could have identified which variable would be more helpful in Part B. (Hint: which variable is going to give us information we kind of already know about?)

college %>% ggplot(aes(x = ACT_median, y = Private)) +
  geom_boxplot()

college %>% ggplot(aes(x = Debt_median, y = Private)) +
  geom_boxplot()

Part D: Check the residual conditions for the better fitting model in Part B by looking at the QQPlot, the histogram of residuals, and also the plot of residuals vs the quantitative predictor. For these last two plots, also color/fill the residuals by Private vs. Public. What do you see?

Part E: Some might argue that checking residual conditions demonstrates a ‘prediction’ mindset (see reading assignment). Make an arguement for this statement.

Part F: A more ‘explanation focused’ approach may still find value using the better regression model in part C even if the conditions/assumptions are not met. Using the summary output, generally explain the relationship between Enrollment and the predictors. Did we still learn something? Note: Violation of the assumptions conditions actually means that the p-values are not accurate (how inaccurate is complicated). The results can be used to guide future research as an exploratory analysis.