library(dplyr)
library(ggplot2)

# Pretty plots
theme_set(theme_bw())

college <- read.csv("https://remiller1450.github.io/data/Colleges2019_Complete.csv")

Re-introduction to Linear Regression in R

We were first introduced to linear regression earlier this semester. We continue this topic today, re-demonstrating the key functions in R that are used for linear regression, as well as some new tools for doing so.

Simple Linear Regression

Recall that simple linear regression (as opposed to multivariate linear regression) is a continuous/quantitative outcome variable with one explanatory variable, assumed to be of the form

\[ y = \beta_0 + X\beta_1 + \epsilon \] Note that this formula represents “a line + error”. Once we have estimated values for \(\beta\) with \(\hat{\beta}\), we can also give a predicted value of \(y\) for a given value of \(X\):

\[ \hat{y} = \hat{\beta}_0 + X \hat{\beta}_1 \] Our estimate of the error term is called the residual and is given by the difference between the true value of \(y\) and our prediction:

\[ e_i = y_i - \hat{y}_i \] Ultimately, the predicted values for \(\beta\)’s are those chosen as to minimize the residual squared error, \(\sum_{i=1}^n e_i^2\).


For practice, we will use the mtcars dataset that comes built-in to R. You can use ?mtcars to get a description of each of the data variables

data(mtcars)
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Similar to ANOVA and the t-test functions, the lm() function (Linear Model) operates using a formula syntax with an outcome and predictor variables. Here, we use lm() to estimate a vehicles miles per gallon (mpg) using the size of the engine (disp)

## First fit model with lm
fit <- lm(mpg ~ disp, mtcars)

## Summary function to display output
summary(fit)
## 
## Call:
## lm(formula = mpg ~ disp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8922 -2.2022 -0.9631  1.6272  7.2305 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.599855   1.229720  24.070  < 2e-16 ***
## disp        -0.041215   0.004712  -8.747 9.38e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.251 on 30 degrees of freedom
## Multiple R-squared:  0.7183, Adjusted R-squared:  0.709 
## F-statistic: 76.51 on 1 and 30 DF,  p-value: 9.38e-10

From here, we get a lot of useful summary information. Try to identify the following:

  1. The slope estimate (\(\beta\)) is \(\hat{\beta} = -0.041\), indicating that for each additional cubic inch in displacement, the vehicle is expected to have a reduction of 0.04 miles per gallon
  2. Based on the p-value reported in the summary, we have evidence that the relationship between mpg and displacement is statistically significant, indicating a rejection of the null hypothesis that there is no linear relationship between the two
  3. We have a multiple R-squared of 0.718, indicating that about 70% of the total variance in mpg is explained with vehicle displacement

It’s worth observing that the effect size, or the value of \(\hat{\beta}\), isn’t particularly large: without more context, it is difficult to see if this is a consequence of the effect size being small or if it is a consequence of differences in scale between mpg and engine displacement. We can investigate this further by creating a plot where we find evidence for the latter. Additionally, recall that we can also use the geom_smooth() function with ggplot to illustrate how our regression line will look

ggplot(mtcars, aes(disp, mpg)) + 
  geom_point() + 
  geom_smooth(method = lm, se = FALSE) # remove standard error

The apparent strength of this relationship corroborates what we found in the linear model summary information above.


Question 1: For this question, we will be using the mtcars dataset to investigate the relationship between horsepower (hp) and vehicle weight (wt)


Multivariate Regression

We have seen previously that we can extend our linear model to include more than one predictor or independent variables. For example, in one of our studies we consider the growth of odontoblasts in guinea pigs. In this study, there were two separate variables of interest: the dose of vitamin C delivered and the route of administration (either orange juice or ascorbic acid). The linear model constructed looked like this:

data(ToothGrowth)
lm(len ~ dose + supp, ToothGrowth) %>% summary()
## 
## Call:
## lm(formula = len ~ dose + supp, data = ToothGrowth)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.600 -3.700  0.373  2.116  8.800 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.2725     1.2824   7.231 1.31e-09 ***
## dose          9.7636     0.8768  11.135 6.31e-16 ***
## suppVC       -3.7000     1.0936  -3.383   0.0013 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.236 on 57 degrees of freedom
## Multiple R-squared:  0.7038, Adjusted R-squared:  0.6934 
## F-statistic: 67.72 on 2 and 57 DF,  p-value: 8.716e-16

Written as an equation, this has the form

\[ \hat{y} = 9.27 + 9.764 \times (\text{dose}) - 3.7 \times \mathbb{1}_{VC} \] That is, we find:

  1. An intercept of 9.27 (note however that this is not a meaningful term as our independent variable dose never takes on the value 0)
  2. A slope of 9.76, indicating that every additional milligram of vitamin C resulted in 9.76 millimeter change in odontoblast length
  3. A -3.7 associated with the indicator for ascorbic acid. As this is an indicator term, it effectively shifts the regression line vertically, according to the sign. Additionally, because this is an indicator for ascorbic acid, it tells us that orange juice has become the reference variable

A plot of this regression looks like this:

Looking at this, we see indeed that the intercept for the red line is about 9.2, with the intercept for ascorbic acid is shifted 3.7 lower. The slope for these two lines remains the same.

Question 2: Consider the dog dataset we used in class

dogs <- read.csv("https://collinn.github.io/data/dogs.csv")

Question 3: For this question, we will be creating a linear model to predict enrollment at primary undergraduate institutions in the United States

college <- read.csv("https://remiller1450.github.io/data/Colleges2019_Complete.csv")

Error

In a non-statistical context, the idea of a line specifying the relationship between two variables is considered a functional relationship, i.e., I can directly find the correct value of \(y\) given a value of \(X\). This is precisely the kind of relationship we would see, for example, when describing the value of a temperature in Celsius (y) given a temperature in Fahrenheit (X)

\[ y = -32 - \left(\frac{5}{9}\right) X \]

In contrast, we may be interested in establishing a statistical linear relationship between two variables, such as height and weight. In this case, there is no straight line that would perfectly equate a value for weight given a certain height:

When describing a statistical relationship between two variables, it is custom to account for this variability about the line with the inclusion of an error term, \(\epsilon\) (epsilon):

\[ y = \beta_0 + X \beta_1 + \epsilon \]

In other words, we assume there is a linear relationship between \(y\) and \(X\) (given by \(y = \beta_0 + X \beta_1\)), plus some error term. The difference between our observered value of \(y\) and the predicted value \(\hat{y}\) is our estimate of the error term, also called a residual, i.e., \(e_i = y_i - \hat{y}_i\)

One of the primary assumptions for linear regression is that the error terms are normally distributed with a mean value of \(0\) and variability of \(\sigma^2\). Notationally, this is written

\[ \epsilon \sim N(0, \sigma^2) \] In plain language, this means that we expect, on average for our response variable to fall near our line of best fit, but that the error around this line should be plus or minus some value. By investigating our residuals, we can accomplish two goals:

  1. First, much of the theory around linear regression rests on the assumption that the errors are normally distributed. If this is not true, we may have issues in making future predictions
  2. Second, patterns in our residuals can indicate that we are missing critical information. By comparing our residuals with missing variables, we can often identify additional variables to include in our model.

Once we have fit a model with lm(), we can access the residuals using the $, just as we would for a data frame. Here, for example, are the residuals from the model fit in Question 1

## Fit model
fit <- lm(hp ~ wt, mtcars)

## Get residuals
res <- fit$residuals

## Plot them
hist(res, breaks = 8)

Here we see a slight skew, indicating that there are more large (positive) residuals than there large (negative) ones, yet most appear to be centered around 0.

Also useful is to plot residuals against the original variable in the model, in this case, weight. Below is a plot that shows how the residuals look for each value of wt

## Use fit$residuals in the y spot since `residuals` does not exist in mtcars
ggplot(mtcars, aes(wt, fit$residuals)) + geom_point() +
  geom_hline(yintercept = 0, color = 'red', linetype = "dashed")

Here, we see the same thing again – many residuals are close to zero, with a handful of very large residuals that make up the right side of the skew. While it doesn’t appear that the residuals here are exceptionally normal, they are likely close enough for our purposes.

Contrast this with the model for height and weight we saw above, which has errors that are very clearly not normally distributed:

fit <- lm(weight ~ height, women)

ggplot(women, aes(height, weight)) +
  geom_smooth(method = lm, se = FALSE) +
  geom_point() + ggtitle("Fitted Line")

ggplot(women, aes(height, fit$residuals)) +
  geom_point() + ggtitle("Residual Plot") +
  geom_hline(yintercept = 0, color = 'red', linetype = "dashed")

fit <- lm(disp ~ wt, mtcars)

ggplot(mtcars, aes(wt, fit$residuals)) +
  geom_point() + ggtitle("Residual Plot") +
  geom_hline(yintercept = 0, color = 'red', linetype = "dashed")

Checking to see if a distribution is Normal (QQplot)

The Normal QQ Plot (quantile-quantile) compares the quantiles of your data to the theoretical distributions of a Normal distribution that has the same mean and standard deviation. You should expect to see a straight line if your data seems to resemble a Normal distribution.

The function used to assess this is qqnorm(). The input is then the variable we want to find out if it has a Normal distribution. Recall that the distribution for residuals in the Linear Model using wt to predict `MPG’ was skewed. This is how we get the normal QQ plot.

# res = residuals from linear model
qqnorm(res)

# add line for comparison
qqline(res)


Question 4

For the dogs dataset, construct a histogram of residuals to check Normality assumption for linear regression for a model predicting speed with breed.

Question 5

Continuation of using the college dataset in question 3. For the model predicting enrollment using ACT_median, Debt_median, and Private, check the residual assumptions by looking at histogram of residuals, and also plot residuals vs the two quantitative variables. For these last two plots, also color the residuals by Private vs. Public. What do you see?