The goal for this lab is two-fold. Continue to practice regression
content, as well as introduce more functions and R methods
that are helpful for working with regression techniques.
Run the following code to load packages and datasets.
# edit this R chunk from {r} to {r, warning=F, message=F} to hide package messages
library(ggplot2)
library(dplyr)
data = read.table("https://nfriedrichsen.github.io/data/HappyPlanetIndex.txt", header=T, sep=",")
HappyPlanet = data.frame(data) %>% select(-Region.2)
This lab will cover methods for analyzing two quantitative variables
(we will talk about categorical variables like in the lecture shortly).
We will return to the Happy Planet data that we used in the previous
regression lab. The starting point in such an analysis should always be
to graph the involved variables using a scatter plot, which is done
using geom_point() in ggplot:
ggplot(HappyPlanet, aes(x = LifeExpectancy, y = Happiness)) + geom_point()
When looking at a scatter plot we want to judge the following:
These attributes can be difficult to judge, but we can use smoothing to help assess form:
## Add loess smoother with error band
ggplot(HappyPlanet, aes(x = LifeExpectancy, y = Happiness)) + geom_point() +
geom_smooth(method = "loess")
The "loess" smoothing method essentially takes a locally
weighted average of the trend present in a certain region of the graph,
thereby allowing the trend line to have different slopes in different
locations.
We can plot a "loess" smoother on top of a linear
regression line to judge whether a relationship can reasonably be
considered “linear”:
## Combine loess w/ error bands and linear regression line (shown in red)
ggplot(HappyPlanet, aes(x = LifeExpectancy, y = Happiness)) + geom_point() +
geom_smooth(method = "loess") +
geom_smooth(method = "lm", se = FALSE, color = "red")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
In this example, we see that the regression line is largely within
the error band of the "loess" smoother, suggesting it is
reasonable to conclude that there is an approximately linear
relationship between the variables “LifeExpectancy” and “Happiness”.
To provide a second example, the data below exhibits a non-linear relationship, as the regression line is noticeably above the “loess” smoother’s error bands for GDP per capita values less than $2000, and noticeably below it for values between $2000 and $10000.
## Combine loess w/ error bands and linear regression line (shown in red)
ggplot(HappyPlanet, aes(x = GDPperCapita, y = Happiness)) + geom_point() +
geom_smooth(method = "loess") +
geom_smooth(method = "lm", se = FALSE, color = "red")
It might be more reasonable to describe this relationship as logarithmic (changing at a slowing rate). The types of relationships you should be comfortable identifying are exponential (changing at an increasing rate), piecewise linear (two distinct slopes in different parts of the graph), and quadratic (parabolic or U-shaped):
(We saw this first part in the previous lab) Often times when we try
to examine the correlation between many variables at the same time, it
can be helpful to arrange them in a table or matrix. The output below
computes the Pearson’s correlation coefficient between the quantitative
variables in the HappyPlanet data.
HappyPlanet_quant_only = HappyPlanet %>% select(where(is.numeric))
cor(x = HappyPlanet_quant_only, method = "pearson")
## Region Happiness LifeExpectancy Footprint HLY
## Region 1.00000000 -0.38873868 -0.17287036 -0.1933923 -0.34928587
## Happiness -0.38873868 1.00000000 0.83238194 0.6435921 0.97440023
## LifeExpectancy -0.17287036 0.83238194 1.00000000 0.6272413 0.92450723
## Footprint -0.19339232 0.64359212 0.62724130 1.0000000 0.69322700
## HLY -0.34928587 0.97440023 0.92450723 0.6932270 1.00000000
## HPI -0.22590887 0.58999384 0.54337762 -0.1806185 0.55727149
## HPIRank 0.21939986 -0.55521562 -0.52066824 0.2100918 -0.52612709
## GDPperCapita NA NA NA NA NA
## HDI NA NA NA NA NA
## Population 0.09641703 0.04270338 0.01359268 -0.0686804 0.02377288
## HPI HPIRank GDPperCapita HDI Population
## Region -0.2259089 0.2193999 NA NA 0.09641703
## Happiness 0.5899938 -0.5552156 NA NA 0.04270338
## LifeExpectancy 0.5433776 -0.5206682 NA NA 0.01359268
## Footprint -0.1806185 0.2100918 NA NA -0.06868040
## HLY 0.5572715 -0.5261271 NA NA 0.02377288
## HPI 1.0000000 -0.9869268 NA NA 0.13781113
## HPIRank -0.9869268 1.0000000 NA NA -0.15832782
## GDPperCapita NA NA 1 NA NA
## HDI NA NA NA 1 NA
## Population 0.1378111 -0.1583278 NA NA 1.00000000
Each cell of the correlation matrix contains the pairwise correlation between the two variables represented by the corresponding row and column labels. This is why all of the diagonal elements of the matrix have a value of exactly 1, since every variable is perfectly correlated with itself.
You should also notice the NA values in the matrix, which are due to
some variables not having data on every country. We can use the argument
use = "complete.obs" to only use countries with complete
data (this will filter the data to remove any rows with missing values
in one or more variables); however, you should notice that this will
affect the entire correlation matrix, not just the variables with
missing data:
cor(x = HappyPlanet_quant_only, method = "pearson", use = "complete.obs")
## Region Happiness LifeExpectancy Footprint HLY
## Region 1.00000000 -0.39403612 -0.18329022 -0.19614183 -0.35715858
## Happiness -0.39403612 1.00000000 0.83342784 0.64329499 0.97482271
## LifeExpectancy -0.18329022 0.83342784 1.00000000 0.62661887 0.92457559
## Footprint -0.19614183 0.64329499 0.62661887 1.00000000 0.69236646
## HLY -0.35715858 0.97482271 0.92457559 0.69236646 1.00000000
## HPI -0.23185284 0.59024399 0.54409200 -0.18107454 0.55783015
## HPIRank 0.22555826 -0.55519491 -0.52064898 0.21120155 -0.52611546
## GDPperCapita -0.24143178 0.69768299 0.66620716 0.88482249 0.75209315
## HDI -0.14590423 0.83559974 0.92656383 0.73675814 0.90758362
## Population 0.09973282 0.04256389 0.01391005 -0.06962928 0.02364168
## HPI HPIRank GDPperCapita HDI Population
## Region -0.23185284 0.22555826 -0.24143178 -0.145904231 0.099732817
## Happiness 0.59024399 -0.55519491 0.69768299 0.835599738 0.042563894
## LifeExpectancy 0.54409200 -0.52064898 0.66620716 0.926563827 0.013910049
## Footprint -0.18107454 0.21120155 0.88482249 0.736758140 -0.069629281
## HLY 0.55783015 -0.52611546 0.75209315 0.907583624 0.023641676
## HPI 1.00000000 -0.98695406 -0.02162382 0.376950260 0.138471998
## HPIRank -0.98695406 1.00000000 0.04132563 -0.352355297 -0.158931586
## GDPperCapita -0.02162382 0.04132563 1.00000000 0.779706121 -0.042591186
## HDI 0.37695026 -0.35235530 0.77970612 1.000000000 -0.008187599
## Population 0.13847200 -0.15893159 -0.04259119 -0.008187599 1.000000000
For data sets containing a large number of quantitative variables it
can be difficult to wade through the large number of correlations
reported in the correlation matrix. In these circumstances it can be
beneficial to visualize the correlation matrix and use aesthetics like
color and position to help understand the relationships between
variables in the data set. The corrplot library and
corrplot() function are demonstrated below:
# install.packages("corrplot")
library(corrplot)
## Store the correlation matrix
my_corr_matrix = cor(x = HappyPlanet_quant_only, method = "pearson", use = "complete.obs")
## Plot using corrplot
corrplot(my_corr_matrix, method = "color", addCoef.col = "black", order="hclust")
The argument order="hclust" applies hierarchical
clustering to rearrange the ordering of variables in the matrix such
that clusters appear together. This allows us to more easily locate
blocks of variables that are all highly correlated with each other in
the same way. For example, “Footprint”, “GDPperCapita”, “Happiness”,
“HLY”, “LifeExpectancy”, and “HDI” form a cluster of positively
correlated variables.
Note: You can change colors that the corrplot uses manually with the addition of the following bit of code:
# DO NOT USE THESE COLORS!!!
corrplot(my_corr_matrix, method = "color", addCoef.col = "black", order="hclust", col= colorRampPalette(c("green","blue", "purple"))(10))
Question 1: What benefit does adding the ‘heatmap’ info into the correlation have?
Recall that simple linear regression (as opposed to multivariate linear regression) is a continuous/quantitative outcome variable with one explanatory variable, assumed to be of the form
\[ y = \beta_0 + X\beta_1 + \epsilon \]
Note that this formula represents “a line + error”. Once we have estimated values for \(\beta\) with \(\hat{\beta}\), we can also give a predicted value of \(y\) for a given value of \(X\):
\[ \hat{y} = \hat{\beta}_0 + X \hat{\beta}_1 \] Our estimate of the error term is called the residual and is given by the difference between the true value of \(y\) and our prediction:
\[ e_i = y_i - \hat{y}_i \]
Ultimately, the predicted values for \(\beta\)’s are those chosen as to minimize the residual squared error, \(\sum_{i=1}^n e_i^2\).
We can use the lm() function to find these estimates
using our sample data, thereby allowing us to write out the equation of
the fitted regression line:
## Fit the model
fitted_model = lm(Happiness ~ LifeExpectancy, data = HappyPlanet)
## Print the coefficient estimates
coef(fitted_model)
## (Intercept) LifeExpectancy
## -1.1037161 0.1035194
In the previous scatterplots we saw a linear relationship between Happiness and LifeExpectancy in the HappyPlanet data, allowing us to perform this linear regression.
In this example, \(\beta_0\)=-1.104 and \(\beta_1\)=0.104, so we estimate the happiness of a country whose citizens have an average life expectancy of zero to be -1.104 (which is a meaningless extrapolation), and we expect each 1-point increase in a country’s life expectancy to increase its predicted happiness score by 0.104 pts.
The summary() function can be applied to your linear
model to give us more info about it.
summary(fitted_model)
##
## Call:
## lm(formula = Happiness ~ LifeExpectancy, data = HappyPlanet)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.27971 -0.58685 -0.00041 0.61388 1.47744
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.103716 0.398908 -2.767 0.00642 **
## LifeExpectancy 0.103519 0.005804 17.835 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7637 on 141 degrees of freedom
## Multiple R-squared: 0.6929, Adjusted R-squared: 0.6907
## F-statistic: 318.1 on 1 and 141 DF, p-value: < 2.2e-16
For now we will just focus on the Multiple R-squared and
Estimate parts, but we will return to the rest later. The
other parts correspond to hypothesis tests for slope/intercept
coefficients being 0, with small p-value suggesting statistically
discernible slope or intercept coefficients.
Multiple R-squared is \(R^2\) as we saw in last week’s lecture.
This saves us from having to go back to the correlationplot to calculate
\(R^2\) using correlation.
We are going to return back to the Iris dataset we used
at the very end of the last lab. More info about the data is here.
# load `iris` dataset
data(iris)
Part A: Make a correlation heat-map matrix of the
quantitative variables in the iris dataset. Which two
variables have the strongest correlation?
Part B: Make a scatterplot of the two strongly
correlated variables in Part A. Describe the relationship with context.
Use the loess smoother to justify the form of the
relationship.
Part C: Fit a regression model using
Petal.Width to predict Petal.Length. Write out
the coefficient values and interpret them.
Part D: Would it be appropriate to predict
Petal.Length for an iris flower with
Petal.Width = 4. What about
Petal.Width = 0.8?
Part E: Suppose I collect a measurement on a different iris flower with a Petal length of 3.25 and a Petal width of 0.8. Calculate the predicted value for this flower according to the regression equation. Calculate and interpret the residual for this flower.
As we have seen in today’s lecture, we can extend linear regression
to work with categorical predictors. We saw this briefly using the
iris data in the last lab. We revisit it below.
##
## Call:
## lm(formula = Petal.Length ~ Species, data = iris)
##
## Coefficients:
## (Intercept) Speciesversicolor Speciesvirginica
## 1.462 2.798 4.090
By default R will make one of the categories a reference
group. This impacts our coefficient interpretations. Notice that Setosa
is our reference group.
1.462 is the predicted Petal length for a Setosa flower.
2.798 is the adjustment we make to get the prediction for versicolor’s petal length which will be 1.462 + 2.798 = 4.26.
4.09 is the adjustment we make to get the prediction for virginica’s petal length which will be 1.462 + 4.09 = 5.552.
Sanity check: Double check the above plot to make sure our calculations make sense.
If you want R not to use a reference group, use the
following code which forces an intercept of 0.
lm(Petal.Length ~ Species+0, data=iris)
##
## Call:
## lm(formula = Petal.Length ~ Species + 0, data = iris)
##
## Coefficients:
## Speciessetosa Speciesversicolor Speciesvirginica
## 1.462 4.260 5.552
Now all of these coefficients correspond directly to the group predictions.
Question 3: Write out the linear regression equation
for both of these lm() outputs corresponding to the iris
species using indicator notation.
Creating indicator variables (or One-Hot Encoding) is not necessary
for running the lm() code in the previous examples. The
following is a scenario where it could be helpful, and how to perform it
using mutate and ifelse functions in
R. Remember that ifelse first takes a
condition, then performs a specified outcome if the condition is met,
and gives the 2nd specified outcome if the condition is not
met.
# indicators for regions 1 and 7
HappyPlanet_indicators = HappyPlanet %>%
mutate(Reg1_ind = as.factor(ifelse(Region == 1, 1, 0)),
Reg7_ind = as.factor(ifelse(Region == 7, 1, 0)))
You can use indicators to subset or filter data (though this is not
necessarily efficient). Indicators allow us to specify only certain
regions to use in a regression. For example, the following code uses all
Regions 1 through 7 in the Regions variable. The result is
messy.
# regression for Happiness using Regions as predictors
# names could be a lot better
lm(data = HappyPlanet, Happiness~as.factor(Region)+0)
##
## Call:
## lm(formula = Happiness ~ as.factor(Region) + 0, data = HappyPlanet)
##
## Coefficients:
## as.factor(Region)1 as.factor(Region)2 as.factor(Region)3 as.factor(Region)4
## 6.913 7.554 5.988 4.048
## as.factor(Region)5 as.factor(Region)6 as.factor(Region)7
## 5.586 6.317 5.737
If instead we were only interested in making predictions for regions 1 and 7, I could just use the corresponding indicators. So it seems indicators can be helpful for choosing a subset of categories.
# regression for Happiness only Regions 1 and 7
lm(data=HappyPlanet_indicators, Happiness~Reg1_ind+Reg7_ind)
##
## Call:
## lm(formula = Happiness ~ Reg1_ind + Reg7_ind, data = HappyPlanet_indicators)
##
## Coefficients:
## (Intercept) Reg1_ind1 Reg7_ind1
## 5.71304 1.19946 0.02399
Including categorical variable information into our regression can be
useful to see if different groups do different things. Consider the Iris
data again with the Petal.Length vs
Petal.Width plot we saw previously, this time I’ve included
information about the species.
## scatterplot
iris %>% ggplot(aes(Petal.Width,Petal.Length,color=Species)) + geom_point()
Question 5: What behavior can we generally see according to the addition of Species information. Should we be using the same linear regression equation to make predictions for each species?
Question 6: Modify the iris data
regression from Question 2C to include the Species
variable. There should be 4 total coefficients in the regression
output.
Below is the output from the regression in Question 6 in the form of the scatterplot.
# install.packages("moderndive")
library(moderndive)
## scatterplot with regression
ggplot(data = iris, aes(Petal.Width, Petal.Length, color=Species)) +
geom_point() + geom_parallel_slopes(se=F)
Question 7: Use the above scatterplot output to make interpretations of the coefficients in Question 6. Hint: Remember reference groups!
The following is a modification to the linear model using all of the same variables we just were using, but is specified in a different way. It allows all groups to have different slopes and intercepts, not just the same slopes with different intercepts.
# scatterplot
ggplot(data = iris, aes(Petal.Width, Petal.Length, color=Species)) +
geom_point() + geom_smooth(method = "lm", se=F)
# regression
lm(data=iris, Petal.Length ~ Petal.Width*Species)
##
## Call:
## lm(formula = Petal.Length ~ Petal.Width * Species, data = iris)
##
## Coefficients:
## (Intercept) Petal.Width
## 1.3276 0.5465
## Speciesversicolor Speciesvirginica
## 0.4537 2.9131
## Petal.Width:Speciesversicolor Petal.Width:Speciesvirginica
## 1.3228 0.1008
# regression to just the Setosa group
lm(data=iris[iris$Species=="setosa",], Petal.Length~Petal.Width)
##
## Call:
## lm(formula = Petal.Length ~ Petal.Width, data = iris[iris$Species ==
## "setosa", ])
##
## Coefficients:
## (Intercept) Petal.Width
## 1.3276 0.5465
Question 9: Interpret the Intercept,
and Petal.Width coefficients in the
lm(data=iris, Petal.Length ~ Petal.Width*Species) model.
This should match the regression for just Setosa data.
Question 10: (Tricky) Interpret the
Speciesversicolor and
Petal.Width:Speciesversicolor coefficients. Hint:
It may be helpful to compare the previous coefficients in Question 9
with the following regression output and see how they match up. They
should be slope and intercept adjustments.
lm(data=iris[iris$Species=="versicolor",], Petal.Length~Petal.Width)
##
## Call:
## lm(formula = Petal.Length ~ Petal.Width, data = iris[iris$Species ==
## "versicolor", ])
##
## Coefficients:
## (Intercept) Petal.Width
## 1.781 1.869
Question 11: Using the
lm(data=iris, Petal.Length ~ Petal.Width*Species) model,
what is the predicted petal length for a virginica flower with petal
width of 2. How does this compare to the regression model that ignored
the species variable? Are the two answers very different?
We can look at \(R^2\) output from the different linear regression models to get a quick sense of how the different models are performing. Remember that \(R^2\) is the ‘percentage of variation in y explained by our regression model’, with values closer to 1 indicating better fit and values closer to 0 indicating not very good fits. A strong correlation value of 0.7 would translate to an \(R^2\) of \(0.7^2 = 0.49\) for reference. “Multiple” \(R^2\) generalizes to using multiple predictors and is not directly translateable back to a single correlation \(r\) but otherwise still tells us how well the model fits.
The following code gives the \(R^2\) values for the different linear models with the iris data.
# regression ignoring species data
summary(lm(data=iris, Petal.Length ~ Petal.Width))$r.squared
## [1] 0.9271098
# regression with parallel slopes, intercept depends on species
summary(lm(data=iris, Petal.Length ~ Petal.Width + Species))$r.squared
## [1] 0.9551318
# regression with different slopes and different intercepts for each species
summary(lm(data=iris, Petal.Length ~ Petal.Width*Species))$r.squared
## [1] 0.9594774
Question 12: How well does it seem the regression ignoring species performs?
Question 13: How well does the model with parallel slopes perform compared to ignoring species? Is the model harder to interpret than when we ignored species? Does the increase in complexity seem justified by better performance?
Question 14: Does the addition of different slopes to the regression result in increased performance? Do you think increase in complexity is worth the better performance?