Multivariate Regression

We have now seen that we can extend our linear model to include more than one predictor or independent variables. For example, in one of our studies we consider the growth of odontoblasts in guinea pigs. In this study, there were two separate variables of interest: the dose of vitamin C delivered and the route of administration (either orange juice or ascorbic acid). The linear model constructed looked like this:

library(ggplot2)
library(dplyr)
data(ToothGrowth)
lm(len ~ dose + supp, ToothGrowth) %>% summary()
## 
## Call:
## lm(formula = len ~ dose + supp, data = ToothGrowth)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.600 -3.700  0.373  2.116  8.800 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.2725     1.2824   7.231 1.31e-09 ***
## dose          9.7636     0.8768  11.135 6.31e-16 ***
## suppVC       -3.7000     1.0936  -3.383   0.0013 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.236 on 57 degrees of freedom
## Multiple R-squared:  0.7038, Adjusted R-squared:  0.6934 
## F-statistic: 67.72 on 2 and 57 DF,  p-value: 8.716e-16

Written as an equation, this has the form

\[ \hat{y} = 9.27 + 9.764 \times (\text{dose}) - 3.7 \times \mathbb{1}_{VC} \] That is, we find:

  1. An intercept of 9.27 (note however that this is not a meaningful term as our predictor variable dose never takes on the value 0)
  2. A slope of 9.76, indicating that every additional milligram of vitamin C resulted in 9.76 millimeter change in odontoblast length
  3. A -3.7 associated with the indicator for ascorbic acid. As this is an indicator term, it effectively shifts the regression line vertically, according to the sign. Additionally, because this is an indicator for ascorbic acid, it tells us that orange juice has become the reference variable

A plot of this regression looks like this:

Looking at this, we see indeed that the intercept for the red line is about 9.2, with the intercept for ascorbic acid is shifted 3.7 lower. The slope for these two lines remains the same.

College Data

For the following questions, our goal is to create a linear model to predict enrollment at primary undergraduate institutions in the United States.

college_data <- read.csv("https://remiller1450.github.io/data/Colleges2019_Complete.csv")

Question 1

Create a correlation heatmap and identify some of the variables that may have a linear relationship (positive or negative) with enrollment.

colleges_quant_only = college_data %>% select(where(is.numeric))
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.5.1
## corrplot 0.95 loaded
## Store the correlation matrix
my_corr_matrix = cor(x = colleges_quant_only, method = "pearson", use = "complete.obs")
## Plot using corrplot, change colors if needed
corrplot(my_corr_matrix, method = "color", addCoef.col = "black", order="hclust")

Question 2

  • Part A: Create a linear model with Enrollment as our response variable and ACT_median and Private as our explanatory variables. Does there seem to be a linear relationship between the response and predictor variables?
lm(Enrollment ~ ACT_median + Private, college_data) %>% summary()
## 
## Call:
## lm(formula = Enrollment ~ ACT_median + Private, data = college_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -13197  -3309   -355   1600  43985 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -16565.59    1328.19  -12.47   <2e-16 ***
## ACT_median       808.32      54.66   14.79   <2e-16 ***
## PrivatePublic   9147.79     395.55   23.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6408 on 1092 degrees of freedom
## Multiple R-squared:  0.3893, Adjusted R-squared:  0.3882 
## F-statistic:   348 on 2 and 1092 DF,  p-value: < 2.2e-16

Answer: Yes, based on the F-statistic and p-value there does (would need to check the scatterplot to be sure).

  • Part B: Using the summary information, explain how ACT_median and Private impact the predicted enrollment.

Answer: Larger ACT_median results in increased predictions for Enrollment. A college being Private results in a lower predicted enrollment.

  • Part C: Using multiple \(R^2\), explain how well this model fits.

Answer: The model fit is only OK-ish. We are explaining only about 40% of the variation in enrollment. Much of the variation is unaccounted for.

  • Part D: Use a scatterplot to make sure Enrollment actually has a linear relationship with the predictors. Are there any issues? Regardless we will continue to practice in the following questions.
college_data %>% ggplot(aes(x = ACT_median, y=Enrollment, color=Private)) + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The relationship between ACT_median and Enrollment for public colleges may be non-linear.

Question 3

We are now going to see if adding other variables gives us a better performing linear model.

  • Part A: Create two scatter plots investigating the relationship between Enrollment and both Debt_median and Salary10yr_median. Which appears to be most correlated with Enrollment? Which do you think would be a better variable to add to the linear model in Question 2?
college_data %>% ggplot(aes(x=ACT_median, y = Debt_median)) + 
  geom_point()

college_data %>% ggplot(aes(x=ACT_median, y = Salary10yr_median)) + 
  geom_point()

# relationship between ACT_median and Salary10yr_median looks
# much stronger and also linear => should be better to add?
  • Part B: Create two additional linear models for predicting Enrollment, one adding Debt_median to the model in Q2A, the other adding Salary10yr_median. Is what you found consistent with what you chose in Question 3 Part A? What metric(s) would you use in support of your choice.
lm(Enrollment ~ ACT_median + Salary10yr_median + Private, college_data) %>% summary()
## 
## Call:
## lm(formula = Enrollment ~ ACT_median + Salary10yr_median + Private, 
##     data = college_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -13199  -3335   -390   1570  44107 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.652e+04  1.330e+03 -12.419   <2e-16 ***
## ACT_median         7.705e+02  7.331e+01  10.510   <2e-16 ***
## Salary10yr_median  1.827e-02  2.362e-02   0.774    0.439    
## PrivatePublic      9.168e+03  3.965e+02  23.123   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6409 on 1091 degrees of freedom
## Multiple R-squared:  0.3896, Adjusted R-squared:  0.3879 
## F-statistic: 232.1 on 3 and 1091 DF,  p-value: < 2.2e-16
lm(Enrollment ~ ACT_median + Debt_median + Private, college_data) %>% summary()
## 
## Call:
## lm(formula = Enrollment ~ ACT_median + Debt_median + Private, 
##     data = college_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -13021  -3239   -393   1564  43696 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.389e+04  1.510e+03  -9.199  < 2e-16 ***
## ACT_median     8.453e+02  5.528e+01  15.291  < 2e-16 ***
## Debt_median   -1.974e-01  5.395e-02  -3.658 0.000266 ***
## PrivatePublic  8.699e+03  4.120e+02  21.112  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6372 on 1091 degrees of freedom
## Multiple R-squared:  0.3967, Adjusted R-squared:  0.395 
## F-statistic: 239.1 on 3 and 1091 DF,  p-value: < 2.2e-16

Answer: Adding either actually performs about the same judging by \(R^2\), the F-stat, and p-value.

Question 4

Find Grinnell in the college dataset. Make an enrollment prediction for Grinnell using the better regression model in Question 3. Interpret the residual. Did our regression model perform well for Grinnell?

college_data[college_data$Name=="Grinnell College",]$ACT_median
## [1] 32
college_data[college_data$Name=="Grinnell College",]$Debt_median
## [1] 15000
model = lm(Enrollment ~ ACT_median + Debt_median + Private, college_data)

predict(model, newdata=college_data[college_data$Name=="Grinnell College",])
##      314 
## 10198.94
# could also have used values directly with the equation

Question 5

dogs <- read.csv("https://collinn.github.io/data/dogs.csv")
dogs %>% ggplot(aes(x=speed, y = color)) + geom_boxplot() +
  geom_jitter(height = 0.1) # height controls vertical space in jitter

Part A: Not really. Colors all have very similar medians. In addition colors have so much variability that the medians (and averages) are not going to do well.

model_color = lm(speed~color, data=dogs)
model_color %>% summary()
## 
## Call:
## lm(formula = speed ~ color, data = dogs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.1100  -9.8896   0.0764   9.2445  24.8540 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  20.8482     0.8005  26.045   <2e-16 ***
## colorbrown    1.9612     1.7899   1.096   0.2739    
## colorwhite    4.2360     1.7899   2.367   0.0184 *  
## coloryellow  -2.3968     1.3864  -1.729   0.0846 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.32 on 396 degrees of freedom
## Multiple R-squared:  0.03152,    Adjusted R-squared:  0.02418 
## F-statistic: 4.296 on 3 and 396 DF,  p-value: 0.005334

Part B: Based on the p-value (small) there is strong evidence of color being a useful predictor for speed. Based on the incredibly small \(R^2\) value, color explains almost none of the variability in speed, and so is not a useful predictor.

ggplot(dogs, aes(speed, model_color$residuals, color=color)) +
  geom_point() + ggtitle("Residual Plot") +
  geom_hline(yintercept = 0, color = 'red', linetype = "dashed")

Part C: It is not useful. The linear pattern in the residuals indicates something weird is going on. The particular pattern we see is just due to only using one value for each color to predict speeds.

# histogram of residuals
ggplot(dogs, aes(y=model_color$residuals)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

# Normal QQ Plot
qqnorm(model_color$residuals)
qqline(model_color$residuals)

Part D: The histogram of residuals is not Normal. In fact it is multimodal which indicates most residuals are not around 0 but are in fact around 10 or -8 or -15. We are not getting small prediction error most of the time (bad!)

The QQplot deviates from the lines further indicating the residuals from this model do not follow a Normal distribution.

ggplot(dogs, aes(speed, model_color$residuals, color=breed)) +
  geom_point() + ggtitle("Residual Plot") +
  geom_hline(yintercept = 0, color = 'red', linetype = "dashed")

ggplot(dogs, aes(y=model_color$residuals, fill = breed)) +
  geom_histogram()

Part E: Dogs of each breed tend to all have similar residual values. This means we could take advantage of that info to make better predictions by adding breed to the model (or using it instead).

Question 6

# scatterplot
dogs %>% ggplot(aes(x=speed, y = breed, fill=breed)) + geom_boxplot() +
  geom_jitter(height = 0.1)

# linear model
model_breed = lm(speed~breed, data=dogs)
model_breed %>% summary()
## 
## Call:
## lm(formula = speed ~ breed, data = dogs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.1858 -1.4602  0.0305  1.4408  8.3773 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           9.8860     0.3449   28.66   <2e-16 ***
## breedChihuahua        5.3204     0.4878   10.91   <2e-16 ***
## breedCorgie          14.7062     0.4878   30.15   <2e-16 ***
## breedGerman Shepard  19.8338     0.4878   40.66   <2e-16 ***
## breedGreyhound       29.9318     0.4878   61.36   <2e-16 ***
## breedMastiff         -5.7189     0.4878  -11.72   <2e-16 ***
## breedPoodle          20.0611     0.4878   41.12   <2e-16 ***
## breedSaint Bernard    4.9668     0.4878   10.18   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.439 on 392 degrees of freedom
## Multiple R-squared:  0.9555, Adjusted R-squared:  0.9547 
## F-statistic:  1202 on 7 and 392 DF,  p-value: < 2.2e-16
# histogram of residuals
ggplot(dogs, aes(y=model_breed$residuals)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

# Normal QQ plot
qqnorm(model_breed$residuals)
qqline(model_breed$residuals)

# residuals vs dog breed
ggplot(dogs, aes(y=breed, x=model_breed$residuals)) +
  geom_boxplot() + ggtitle("Residual Plot") +
  geom_hline(yintercept = 0, color = 'red', linetype = "dashed")

Part A: Breed should help us make good predictions. Most dog speeds are very close to their respective group medians/means.

Part B: The adjusted \(R^2\) value using dog breed as a predictor is extemely high. Roughly 95% of differences in dog speed can be accounted for by breed according to our model. We also have a very strong evidence (p-val < .0001) of breed being a useful predictor for speed.

Part C: No. Residual histogram looks roughly Normal. There is some deviation in the tails of the QQplot, but overall quantiles roughly follow the straighline suggesting Normal approximation is OK. Boxplots of each breed show that each is performing very similarly in terms of residual sizes outside of a few outliers for St. Bernard, German Shepherd, and Chihuahua .

Question 7

college <- read.csv("https://remiller1450.github.io/data/Colleges2019_Complete.csv")

fit = lm(Enrollment ~ Private, data=college)
fit %>% summary()
## 
## Call:
## lm(formula = Enrollment ~ Private, data = college)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -10924  -2164  -1196    817  47067 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2720.4      275.9   9.861   <2e-16 ***
## PrivatePublic   8604.7      431.3  19.951   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7017 on 1093 degrees of freedom
## Multiple R-squared:  0.267,  Adjusted R-squared:  0.2663 
## F-statistic:   398 on 1 and 1093 DF,  p-value: < 2.2e-16

Part A: Yes, Private/Public is a useful predictor of Enrollment. We have a very small p-value (strong evidence) from the output to support enrollment differs between Private and Public colleges. In addition, we get \(R^2\)=.26, so around 26% of variation in enrollment is explained by whether the college is public or private.

lm(Enrollment ~ Private + ACT_median, data=college) %>% summary()
## 
## Call:
## lm(formula = Enrollment ~ Private + ACT_median, data = college)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -13197  -3309   -355   1600  43985 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -16565.59    1328.19  -12.47   <2e-16 ***
## PrivatePublic   9147.79     395.55   23.13   <2e-16 ***
## ACT_median       808.32      54.66   14.79   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6408 on 1092 degrees of freedom
## Multiple R-squared:  0.3893, Adjusted R-squared:  0.3882 
## F-statistic:   348 on 2 and 1092 DF,  p-value: < 2.2e-16
lm(Enrollment ~ Private + Debt_median, data=college) %>% summary()
## 
## Call:
## lm(formula = Enrollment ~ Private + Debt_median, data = college)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -10744  -2333  -1200    813  47032 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3560.87097 1088.76969   3.271  0.00111 ** 
## PrivatePublic 8492.78722  453.59167  18.723  < 2e-16 ***
## Debt_median     -0.04662    0.05842  -0.798  0.42503    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7018 on 1092 degrees of freedom
## Multiple R-squared:  0.2674, Adjusted R-squared:  0.266 
## F-statistic: 199.3 on 2 and 1092 DF,  p-value: < 2.2e-16

Part B: The adj. \(R^2\) for the model using ACT-Median is higher and so, based on this measure, would be better for prediction.

college %>% ggplot(aes(x = ACT_median, y = Private)) +
  geom_boxplot()

college %>% ggplot(aes(x = Debt_median, y = Private)) +
  geom_boxplot()

Part C: The boxplots suggest that Debt_median may not be as helpful in predicting Enrollment once we already know whether a school is public or private. The distributions of debt are different between public and private schools, meaning much of the information in Debt_median overlaps with information we already get from the public/private variable.

In contrast, ACT_median looks fairly similar between public and private schools, so it may provide new information that is not already captured by school type. Because of this, ACT_median could potentially add more useful predictive information to the model after accounting for whether a school is public or private.

fit = lm(Enrollment ~ ACT_median + Private, college)
fit %>% summary()
## 
## Call:
## lm(formula = Enrollment ~ ACT_median + Private, data = college)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -13197  -3309   -355   1600  43985 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -16565.59    1328.19  -12.47   <2e-16 ***
## ACT_median       808.32      54.66   14.79   <2e-16 ***
## PrivatePublic   9147.79     395.55   23.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6408 on 1092 degrees of freedom
## Multiple R-squared:  0.3893, Adjusted R-squared:  0.3882 
## F-statistic:   348 on 2 and 1092 DF,  p-value: < 2.2e-16
ggplot(data=college, aes(y=fit$residuals)) + geom_histogram()

ggplot(data=college, aes(y=fit$residuals, fill=Private)) + geom_histogram()

ggplot(college, aes(ACT_median, fit$residuals)) +
  geom_point() + ggtitle("Residual Plot") +
  geom_hline(yintercept = 0, color = 'red', linetype = "dashed")

ggplot(college, aes(ACT_median, fit$residuals, color=Private)) +
  geom_point() + ggtitle("Residual Plot") +
  geom_hline(yintercept = 0, color = 'red', linetype = "dashed")

Part D: From the histograms: The residuals are not normally distributed (right skewed) for public schools, indicating there may be issues in predicting enrollment for public schools, ie. more info might be needed. From the scatterplot: there is a negative linear relationship between the residuals and ACT-median for private colleges, further suggesting there is a predictor variable unaccounted for that could help us.

Part E: Checking residual conditions focuses onis looking at whether the regression model produces reliable predictions by examining how well the model errors behave. In this sense, residual analysis investigates the practical performance and predictive usefulness of the model rather than estimating or interpreting population relationships.

Part F: Yes, we still learned useful information from the model even if the regression assumptions are not fully met. The summary output suggests that both whether a college is public or private and the median ACT score are strongly associated with Enrollment.

Private schools tend to have higher enrollment values than public schools on average, and schools with higher median ACT scores also tend to have larger enrollments. The positive coefficient for ACT_median suggests that, holding school type constant, enrollment tends to increase as median ACT score increases.