We have now seen that we can extend our linear model to include more than one predictor or independent variables. For example, in one of our studies we consider the growth of odontoblasts in guinea pigs. In this study, there were two separate variables of interest: the dose of vitamin C delivered and the route of administration (either orange juice or ascorbic acid). The linear model constructed looked like this:
library(ggplot2)
library(dplyr)
data(ToothGrowth)
lm(len ~ dose + supp, ToothGrowth) %>% summary()
##
## Call:
## lm(formula = len ~ dose + supp, data = ToothGrowth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.600 -3.700 0.373 2.116 8.800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.2725 1.2824 7.231 1.31e-09 ***
## dose 9.7636 0.8768 11.135 6.31e-16 ***
## suppVC -3.7000 1.0936 -3.383 0.0013 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.236 on 57 degrees of freedom
## Multiple R-squared: 0.7038, Adjusted R-squared: 0.6934
## F-statistic: 67.72 on 2 and 57 DF, p-value: 8.716e-16
Written as an equation, this has the form
\[ \hat{y} = 9.27 + 9.764 \times (\text{dose}) - 3.7 \times \mathbb{1}_{VC} \] That is, we find:
dose never takes on the
value 0)A plot of this regression looks like this:
Looking at this, we see indeed that the intercept for the red line is about 9.2, with the intercept for ascorbic acid is shifted 3.7 lower. The slope for these two lines remains the same.
For the following questions, our goal is to create a linear model to predict enrollment at primary undergraduate institutions in the United States.
college_data <- read.csv("https://remiller1450.github.io/data/Colleges2019_Complete.csv")
Create a correlation heatmap and identify some of the variables that may have a linear relationship (positive or negative) with enrollment.
Part A: Create a linear model with Enrollment as our response variable and ACT_median and Private as our explanatory variables. Does there seem to be a linear relationship between the response and predictor variables? Justify your answer using appropriate parts of the output.
Part B: Using the summary information, explain how ACT_median and Private impact the predicted enrollment.
Part C: Using multiple \(R^2\), explain how well this model fits.
Part D: Use scatterplots to make sure Enrollment actually has a linear relationship with the predictors. Are there any issues? Regardless we will continue to practice in the following questions.
We are now going to see if adding other variables gives us a better performing linear model.
Part A: Create two scatter plots investigating the relationship between ACT_median and both Debt_median and Salary10yr_median. Which appears to be most correlated with ACT_median? Which do you think would be a better variable to add to the linear model in Question 2?
Part B: Create two additional linear models for predicting Enrollment, one adding Debt_median to the model in Q2A, the other adding Salary10yr_median. Is what you found consistent with what you chose in Question 3 Part A? What metric(s) would you use in support of your choice.
Part C: Interpret all of the coefficients in the better model you have identified.
Part D: What does the F-test and corresponding p-value tell you for this model?
Find Grinnell in the college dataset. Make an enrollment prediction for Grinnell using the better regression model in Question 3. Interpret the residual. Did our regression model perform well for Grinnell?
We continue this topic today, re-demonstrating the key functions in R that are used for linear regression, as well as some new tools for doing so.
Recall that simple linear regression (as opposed to multivariate linear regression) is a continuous/quantitative outcome variable with one explanatory variable, assumed to be of the form
\[ y = \beta_0 + X\beta_1 + \epsilon \]
Note: that this formula represents “a line + error”. Once we have estimated values for \(\beta\) with \(\hat{\beta}\), we can also give a predicted value of \(y\) for a given value of \(X\):
\[ \hat{y} = \hat{\beta}_0 + X \hat{\beta}_1 \]
Our estimate of the error term is called the residual and is given by the difference between the true value of \(y\) and our prediction:
\[ e_i = y_i - \hat{y}_i \]
Ultimately, the predicted values for \(\beta\)’s are those chosen as to minimize the residual squared error, \(\sum_{i=1}^n e_i^2\).
In a non-statistical context, the idea of a line specifying the relationship between two variables is considered a functional relationship, i.e., I can directly find the correct value of \(y\) given a value of \(X\). This is precisely the kind of relationship we would see, for example, when describing the value of a temperature in Celsius (y) given a temperature in Fahrenheit (X)
\[ y = -32 - \left(\frac{5}{9}\right) X \]
In contrast, we may be interested in establishing a statistical linear relationship between two variables, such as height and weight. In this case, there is no straight line that would perfectly equate a value for weight given a certain height:
When describing a statistical relationship between two variables, it is custom to account for this variability about the line with the inclusion of an error term, \(\epsilon\) (epsilon):
\[ y = \beta_0 + X \beta_1 + \epsilon \]
In other words, we assume there is a linear relationship between \(y\) and \(X\) (given by \(y = \beta_0 + X \beta_1\)), plus some error term. The difference between our observered value of \(y\) and the predicted value \(\hat{y}\) is our estimate of the error term, also called a residual, i.e., \(e_i = y_i - \hat{y}_i\)
One of the primary assumptions for linear regression is that the error terms are normally distributed with a mean value of \(0\) and variability of \(\sigma^2\). Notationally, this is written
\[ \epsilon \sim N(0, \sigma^2) \] In plain language, this means that we expect, on average for our response variable to fall near our line of best fit, but that the error around this line should be plus or minus some value. By investigating our residuals, we can accomplish two goals:
Once we have fit a model with lm(), we can access the
residuals using the $, just as we would for a data
frame.
## Fit model
fit <- lm(hp ~ wt, mtcars)
## Get residuals
res <- fit$residuals
## Plot them
hist(res, breaks = 8)
Here we see a slight skew, indicating that there are more large
(positive) residuals than there large (negative) ones, yet most appear
to be centered around 0.
Also useful is to plot residuals against the original variable in the
model, in this case, weight. Below is a plot that shows how the
residuals look for each value of wt
## Use fit$residuals in the y spot since `residuals` does not exist in mtcars
ggplot(mtcars, aes(wt, fit$residuals)) + geom_point() +
geom_hline(yintercept = 0, color = 'red', linetype = "dashed")
Here, we see the same thing again – many residuals are close to zero,
with a handful of very large residuals that make up the right side of
the skew. While it doesn’t appear that the residuals here are
exceptionally normal, they are likely close enough for our purposes.
Contrast this with the model for height and weight we saw above, which has errors that are very clearly not normally distributed:
fit <- lm(weight ~ height, women)
ggplot(women, aes(height, weight)) +
geom_smooth(method = lm, se = FALSE) +
geom_point() + ggtitle("Fitted Line")
ggplot(women, aes(height, fit$residuals)) +
geom_point() + ggtitle("Residual Plot") +
geom_hline(yintercept = 0, color = 'red', linetype = "dashed")
## `geom_smooth()` using formula = 'y ~ x'
fit <- lm(disp ~ wt, mtcars)
ggplot(mtcars, aes(wt, fit$residuals)) +
geom_point() + ggtitle("Residual Plot") +
geom_hline(yintercept = 0, color = 'red', linetype = "dashed")
The Normal QQ Plot (quantile-quantile) compares the quantiles of your data to the theoretical distributions of a Normal distribution that has the same mean and standard deviation. You should expect to see a straight line if your data seems to resemble a Normal distribution.
The function used to assess this is qqnorm(). The input is then the variable we want to find out if it has a Normal distribution. Recall that the distribution for residuals in the Linear Model using wt to predict `MPG’ was skewed. This is how we get the normal QQ plot.
# res = residuals from linear model
qqnorm(res)
# add line for comparison
qqline(res)
For the dogs dataset, we could construct linear models
predicting speed using combinations of color,
size, or breed.
dogs <- read.csv("https://collinn.github.io/data/dogs.csv")
head(dogs)
## breed speed color size
## 1 Greyhound 34.45972 black large
## 2 Greyhound 40.15329 white large
## 3 Greyhound 36.94788 black large
## 4 Greyhound 40.84922 white large
## 5 Greyhound 45.70215 black large
## 6 Greyhound 42.97792 white large
Residuals help us see how well the model is performing. Answer the following questions related to residuals.
dogs %>% ggplot(aes(x=speed, y = color)) + geom_boxplot() +
geom_jitter(height = 0.1) # height controls vertical space in jitter
Part A: (From lecture): Based on the boxplots, does it look like color is going to help you make very accurate predictions of speed?
Part B: Use the following code to make a linear model predicting speed using color. Based on the summary output, argue for color being a useful predictor of speed. Then, also using the summary output, argue that color is not a useful predictor for speed.
model_color = lm(speed~color, data=dogs)
model_color %>% summary()
##
## Call:
## lm(formula = speed ~ color, data = dogs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.1100 -9.8896 0.0764 9.2445 24.8540
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.8482 0.8005 26.045 <2e-16 ***
## colorbrown 1.9612 1.7899 1.096 0.2739
## colorwhite 4.2360 1.7899 2.367 0.0184 *
## coloryellow -2.3968 1.3864 -1.729 0.0846 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.32 on 396 degrees of freedom
## Multiple R-squared: 0.03152, Adjusted R-squared: 0.02418
## F-statistic: 4.296 on 3 and 396 DF, p-value: 0.005334
Part C: Using the residual plot for the linear regression, argue whether you think color is useful for predicting speed.
ggplot(dogs, aes(speed, model_color$residuals, color=color)) +
geom_point() + ggtitle("Residual Plot") +
geom_hline(yintercept = 0, color = 'red', linetype = "dashed")
# pick better color choices
Part D: Using the following plots argue whether the Normality and Homoscedescatic conditions are met for linear model, and what this implies about how well the model is doing.
# histogram of residuals
ggplot(dogs, aes(y=model_color$residuals)) +
geom_histogram()
# Normal QQ Plot
qqnorm(model_color$residuals)
qqline(model_color$residuals)
Part E: Add breed information into the
residual plots as a color/fill in the aes() argument.
Explain what patterns you see in the residuals that indicate
breed could be a useful predictor.
Goal: After identifying breed being a potentially useful predictor, we are going to repeat questions several of the parts in Question 1.
Part A: Make a boxplots of the dog breeds and their speeds. Based on the boxplot, explain how it look like color is going to help you make very accurate predictions of speed?
Part B: Fit a linear model predicting speed using dog breed. Based on the summary output, argue for breed being a useful predictor of speed.
Part C: Using the following residual plots, do you see any violation of model assumptions? Explain.
We are going back to the college dataset for this problem.
college <- read.csv("https://remiller1450.github.io/data/Colleges2019_Complete.csv")
Part A: Use the following code to make a linear
model predicting Enrollment based on whether the college is
private or public (we did something similar in the previous lab). Based
on the summary output, is whether a college is public or private a
useful predictor of enrollment?
fit = lm(Enrollment ~ Private, data=college)
fit %>% summary()
Part B: Let’s say I wanted a more sophisticated model. Which model below performs better for prediction? Explain using the following summary output.
lm(Enrollment ~ Private + ACT_median, data=college) %>% summary()
lm(Enrollment ~ Private + Debt_median, data=college) %>% summary()
Part C: Based on the following boxplots, explain how we could have identified which variable would be more helpful in Part B. (Hint: which variable is going to give us information we kind of already know about?)
college %>% ggplot(aes(x = ACT_median, y = Private)) +
geom_boxplot()
college %>% ggplot(aes(x = Debt_median, y = Private)) +
geom_boxplot()
Part D: Check the residual conditions for the better fitting model in Part B by looking at the QQPlot, the histogram of residuals, and also the plot of residuals vs the quantitative predictor. For these last two plots, also color/fill the residuals by Private vs. Public. What do you see?
Part E: Some might argue that checking residual conditions demonstrates a ‘prediction’ mindset (see reading assignment). Make an arguement for this statement.
Part F: A more ‘explanation focused’ approach may still find value using the better regression model in part C even if the conditions/assumptions are not met. Using the summary output, generally explain the relationship between Enrollment and the predictors. Did we still learn something? Note: Violation of the assumptions conditions actually means that the p-values are not accurate (how inaccurate is complicated). The results can be used to guide future research as an exploratory analysis.