We have now seen that we can extend our linear model to include more than one predictor or independent variables. For example, in one of our studies we consider the growth of odontoblasts in guinea pigs. In this study, there were two separate variables of interest: the dose of vitamin C delivered and the route of administration (either orange juice or ascorbic acid). The linear model constructed looked like this:
library(ggplot2)
library(dplyr)
data(ToothGrowth)
lm(len ~ dose + supp, ToothGrowth) %>% summary()
##
## Call:
## lm(formula = len ~ dose + supp, data = ToothGrowth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.600 -3.700 0.373 2.116 8.800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.2725 1.2824 7.231 1.31e-09 ***
## dose 9.7636 0.8768 11.135 6.31e-16 ***
## suppVC -3.7000 1.0936 -3.383 0.0013 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.236 on 57 degrees of freedom
## Multiple R-squared: 0.7038, Adjusted R-squared: 0.6934
## F-statistic: 67.72 on 2 and 57 DF, p-value: 8.716e-16
Written as an equation, this has the form
\[ \hat{y} = 9.27 + 9.764 \times (\text{dose}) - 3.7 \times \mathbb{1}_{VC} \] That is, we find:
dose never takes on the
value 0)A plot of this regression looks like this:
Looking at this, we see indeed that the intercept for the red line is about 9.2, with the intercept for ascorbic acid is shifted 3.7 lower. The slope for these two lines remains the same.
For the following questions, our goal is to create a linear model to predict enrollment at primary undergraduate institutions in the United States.
college_data <- read.csv("https://remiller1450.github.io/data/Colleges2019_Complete.csv")
Create a correlation heatmap and identify some of the variables that may have a linear relationship (positive or negative) with enrollment.
Part A: Create a linear model with Enrollment as our response variable and ACT_median and Private as our explanatory variables. Does there seem to be a linear relationship between the response and predictor variables?
Part B: Using the summary information, explain how ACT_median and Private impact the predicted enrollment.
Part C: Using multiple \(R^2\), explain how well this model fits.
Part D: Use a scatterplot to make sure Enrollment actually has a linear relationship with the predictors. Are there any issues? Regardless we will continue to practice in the following questions.
We are now going to see if adding other variables gives us a better performing linear model.
Part A: Create two scatter plots investigating the relationship between ACT_median and both Debt_median and Salary10yr_median. Which appears to be most correlated with ACT_median? Which do you think would be a better variable to add to the linear model in Question 2?
Part B: Create two additional linear models for predicting Enrollment, one adding Debt_median to the model in Q2A, the other adding Salary10yr_median. Is what you found consistent with what you chose in Question 3 Part A? What metric(s) would you use in support of your choice.
Find Grinnell in the college dataset. Make an enrollment prediction for Grinnell using the better regression model in Question 3. Interpret the residual. Did our regression model perform well for Grinnell?