Multivariate Regression

We have now seen that we can extend our linear model to include more than one predictor or independent variables. For example, in one of our studies we consider the growth of odontoblasts in guinea pigs. In this study, there were two separate variables of interest: the dose of vitamin C delivered and the route of administration (either orange juice or ascorbic acid). The linear model constructed looked like this:

library(ggplot2)
library(dplyr)
data(ToothGrowth)
lm(len ~ dose + supp, ToothGrowth) %>% summary()

## 
## Call:
## lm(formula = len ~ dose + supp, data = ToothGrowth)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.600 -3.700  0.373  2.116  8.800 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.2725     1.2824   7.231 1.31e-09 ***
## dose          9.7636     0.8768  11.135 6.31e-16 ***
## suppVC       -3.7000     1.0936  -3.383   0.0013 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.236 on 57 degrees of freedom
## Multiple R-squared:  0.7038, Adjusted R-squared:  0.6934 
## F-statistic: 67.72 on 2 and 57 DF,  p-value: 8.716e-16

Written as an equation, this has the form

\[ \hat{y} = 9.27 + 9.764 \times (\text{dose}) - 3.7 \times \mathbb{1}_{VC} \] That is, we find:

An intercept of 9.27 (note however that this is not a meaningful term as our predictor variable dose never takes on the value 0)
A slope of 9.76, indicating that every additional milligram of vitamin C resulted in 9.76 millimeter change in odontoblast length
A -3.7 associated with the indicator for ascorbic acid. As this is an indicator term, it effectively shifts the regression line vertically, according to the sign. Additionally, because this is an indicator for ascorbic acid, it tells us that orange juice has become the reference variable

A plot of this regression looks like this:

Looking at this, we see indeed that the intercept for the red line is about 9.2, with the intercept for ascorbic acid is shifted 3.7 lower. The slope for these two lines remains the same.

College Data

For the following questions, our goal is to create a linear model to predict enrollment at primary undergraduate institutions in the United States.

college_data <- read.csv("https://remiller1450.github.io/data/Colleges2019_Complete.csv")

Question 1

Create a correlation heatmap and identify some of the variables that may have a linear relationship (positive or negative) with enrollment.

Question 2

Part A: Create a linear model with Enrollment as our response variable and ACT_median and Private as our explanatory variables. Does there seem to be a linear relationship between the response and predictor variables?
Part B: Using the summary information, explain how ACT_median and Private impact the predicted enrollment.
Part C: Using multiple \(R^2\), explain how well this model fits.
Part D: Use a scatterplot to make sure Enrollment actually has a linear relationship with the predictors. Are there any issues? Regardless we will continue to practice in the following questions.

Question 3

We are now going to see if adding other variables gives us a better performing linear model.

Part A: Create two scatter plots investigating the relationship between ACT_median and both Debt_median and Salary10yr_median. Which appears to be most correlated with ACT_median? Which do you think would be a better variable to add to the linear model in Question 2?
Part B: Create two additional linear models for predicting Enrollment, one adding Debt_median to the model in Q2A, the other adding Salary10yr_median. Is what you found consistent with what you chose in Question 3 Part A? What metric(s) would you use in support of your choice.

Question 4