Regression Lab 2

Outline

The goal for this lab is two-fold. Continue to practice regression content, as well as introduce more functions and R methods that are helpful for working with regression techniques.

Run the following code to load packages and datasets.

# edit this R chunk from {r} to {r, warning=F, message=F} to hide package messages
library(ggplot2)
library(dplyr)

data = read.table("https://nfriedrichsen.github.io/data/HappyPlanetIndex.txt", header=T, sep=",")
HappyPlanet = data.frame(data) %>% select(-Region.2)

Assessing “Linearity”

This lab will cover methods for analyzing two quantitative variables (we will talk about categorical variables like in the lecture shortly). We will return to the Happy Planet data that we used in the previous regression lab. The starting point in such an analysis should always be to graph the involved variables using a scatter plot, which is done using geom_point() in ggplot:

ggplot(HappyPlanet, aes(x = LifeExpectancy, y = Happiness)) + geom_point()

When looking at a scatter plot we want to judge the following:

Form - is there a pattern/relationship? If so, is it linear or non-linear?
Direction - do higher values in one variable tend to correspond with higher values in the other (a positive relationship)? or do higher values tend to pair with lower values (a negative or inverse relationship)?
Strength - how closely do the data adhere to the pattern/relationship you identified (ie: strong, moderate, weak, etc.)

These attributes can be difficult to judge, but we can use smoothing to help assess form:

## Add loess smoother with error band
ggplot(HappyPlanet, aes(x = LifeExpectancy, y = Happiness)) + geom_point() +
  geom_smooth(method = "loess")

The "loess" smoothing method essentially takes a locally weighted average of the trend present in a certain region of the graph, thereby allowing the trend line to have different slopes in different locations.

We can plot a "loess" smoother on top of a linear regression line to judge whether a relationship can reasonably be considered “linear”:

## Combine loess w/ error bands and linear regression line (shown in red)
ggplot(HappyPlanet, aes(x = LifeExpectancy, y = Happiness)) + geom_point() +
  geom_smooth(method = "loess") +
  geom_smooth(method = "lm", se = FALSE, color = "red")

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

In this example, we see that the regression line is largely within the error band of the "loess" smoother, suggesting it is reasonable to conclude that there is an approximately linear relationship between the variables “LifeExpectancy” and “Happiness”.

To provide a second example, the data below exhibits a non-linear relationship, as the regression line is noticeably above the “loess” smoother’s error bands for GDP per capita values less than $2000, and noticeably below it for values between $2000 and $10000.

## Combine loess w/ error bands and linear regression line (shown in red)
ggplot(HappyPlanet, aes(x = GDPperCapita, y = Happiness)) + geom_point() +
  geom_smooth(method = "loess") +
  geom_smooth(method = "lm", se = FALSE, color = "red")

It might be more reasonable to describe this relationship as logarithmic (changing at a slowing rate). The types of relationships you should be comfortable identifying are exponential (changing at an increasing rate), piecewise linear (two distinct slopes in different parts of the graph), and quadratic (parabolic or U-shaped):

Correlation Matrix Heatmap

(We saw this first part in the previous lab) Often times when we try to examine the correlation between many variables at the same time, it can be helpful to arrange them in a table or matrix. The output below computes the Pearson’s correlation coefficient between the quantitative variables in the HappyPlanet data.

HappyPlanet_quant_only = HappyPlanet %>% select(where(is.numeric))
cor(x = HappyPlanet_quant_only, method = "pearson")

##                     Region   Happiness LifeExpectancy  Footprint         HLY
## Region          1.00000000 -0.38873868    -0.17287036 -0.1933923 -0.34928587
## Happiness      -0.38873868  1.00000000     0.83238194  0.6435921  0.97440023
## LifeExpectancy -0.17287036  0.83238194     1.00000000  0.6272413  0.92450723
## Footprint      -0.19339232  0.64359212     0.62724130  1.0000000  0.69322700
## HLY            -0.34928587  0.97440023     0.92450723  0.6932270  1.00000000
## HPI            -0.22590887  0.58999384     0.54337762 -0.1806185  0.55727149
## HPIRank         0.21939986 -0.55521562    -0.52066824  0.2100918 -0.52612709
## GDPperCapita            NA          NA             NA         NA          NA
## HDI                     NA          NA             NA         NA          NA
## Population      0.09641703  0.04270338     0.01359268 -0.0686804  0.02377288
##                       HPI    HPIRank GDPperCapita HDI  Population
## Region         -0.2259089  0.2193999           NA  NA  0.09641703
## Happiness       0.5899938 -0.5552156           NA  NA  0.04270338
## LifeExpectancy  0.5433776 -0.5206682           NA  NA  0.01359268
## Footprint      -0.1806185  0.2100918           NA  NA -0.06868040
## HLY             0.5572715 -0.5261271           NA  NA  0.02377288
## HPI             1.0000000 -0.9869268           NA  NA  0.13781113
## HPIRank        -0.9869268  1.0000000           NA  NA -0.15832782
## GDPperCapita           NA         NA            1  NA          NA
## HDI                    NA         NA           NA   1          NA
## Population      0.1378111 -0.1583278           NA  NA  1.00000000

Each cell of the correlation matrix contains the pairwise correlation between the two variables represented by the corresponding row and column labels. This is why all of the diagonal elements of the matrix have a value of exactly 1, since every variable is perfectly correlated with itself.

You should also notice the NA values in the matrix, which are due to some variables not having data on every country. We can use the argument use = "complete.obs" to only use countries with complete data (this will filter the data to remove any rows with missing values in one or more variables); however, you should notice that this will affect the entire correlation matrix, not just the variables with missing data:

cor(x = HappyPlanet_quant_only, method = "pearson", use = "complete.obs")

##                     Region   Happiness LifeExpectancy   Footprint         HLY
## Region          1.00000000 -0.39403612    -0.18329022 -0.19614183 -0.35715858
## Happiness      -0.39403612  1.00000000     0.83342784  0.64329499  0.97482271
## LifeExpectancy -0.18329022  0.83342784     1.00000000  0.62661887  0.92457559
## Footprint      -0.19614183  0.64329499     0.62661887  1.00000000  0.69236646
## HLY            -0.35715858  0.97482271     0.92457559  0.69236646  1.00000000
## HPI            -0.23185284  0.59024399     0.54409200 -0.18107454  0.55783015
## HPIRank         0.22555826 -0.55519491    -0.52064898  0.21120155 -0.52611546
## GDPperCapita   -0.24143178  0.69768299     0.66620716  0.88482249  0.75209315
## HDI            -0.14590423  0.83559974     0.92656383  0.73675814  0.90758362
## Population      0.09973282  0.04256389     0.01391005 -0.06962928  0.02364168
##                        HPI     HPIRank GDPperCapita          HDI   Population
## Region         -0.23185284  0.22555826  -0.24143178 -0.145904231  0.099732817
## Happiness       0.59024399 -0.55519491   0.69768299  0.835599738  0.042563894
## LifeExpectancy  0.54409200 -0.52064898   0.66620716  0.926563827  0.013910049
## Footprint      -0.18107454  0.21120155   0.88482249  0.736758140 -0.069629281
## HLY             0.55783015 -0.52611546   0.75209315  0.907583624  0.023641676
## HPI             1.00000000 -0.98695406  -0.02162382  0.376950260  0.138471998
## HPIRank        -0.98695406  1.00000000   0.04132563 -0.352355297 -0.158931586
## GDPperCapita   -0.02162382  0.04132563   1.00000000  0.779706121 -0.042591186
## HDI             0.37695026 -0.35235530   0.77970612  1.000000000 -0.008187599
## Population      0.13847200 -0.15893159  -0.04259119 -0.008187599  1.000000000

For data sets containing a large number of quantitative variables it can be difficult to wade through the large number of correlations reported in the correlation matrix. In these circumstances it can be beneficial to visualize the correlation matrix and use aesthetics like color and position to help understand the relationships between variables in the data set. The corrplot library and corrplot() function are demonstrated below:

# install.packages("corrplot")
library(corrplot)

## Store the correlation matrix
my_corr_matrix = cor(x = HappyPlanet_quant_only, method = "pearson", use = "complete.obs")

## Plot using corrplot
corrplot(my_corr_matrix, method = "color", addCoef.col = "black", order="hclust")

The argument order="hclust" applies hierarchical clustering to rearrange the ordering of variables in the matrix such that clusters appear together. This allows us to more easily locate blocks of variables that are all highly correlated with each other in the same way. For example, “Footprint”, “GDPperCapita”, “Happiness”, “HLY”, “LifeExpectancy”, and “HDI” form a cluster of positively correlated variables.

Note: You can change colors that the corrplot uses manually with the addition of the following bit of code:

# DO NOT USE THESE COLORS!!!
corrplot(my_corr_matrix, method = "color", addCoef.col = "black", order="hclust", col= colorRampPalette(c("green","blue", "purple"))(10))

Question 1: What benefit does adding the ‘heatmap’ info into the correlation have?

Simple Linear Regression

Recall that simple linear regression (as opposed to multivariate linear regression) is a continuous/quantitative outcome variable with one explanatory variable, assumed to be of the form

\[ y = \beta_0 + X\beta_1 + \epsilon \]

Note that this formula represents “a line + error”. Once we have estimated values for $\beta$ with $\hat{\beta}$, we can also give a predicted value of $y$ for a given value of $X$:

\[ \hat{y} = \hat{\beta}_0 + X \hat{\beta}_1 \] Our estimate of the error term is called the residual and is given by the difference between the true value of $y$ and our prediction:

\[ e_i = y_i - \hat{y}_i \]

Ultimately, the predicted values for $\beta$’s are those chosen as to minimize the residual squared error, $\sum_{i=1}^n e_i^2$.

We can use the lm() function to find these estimates using our sample data, thereby allowing us to write out the equation of the fitted regression line:

## Fit the model
fitted_model = lm(Happiness ~ LifeExpectancy, data = HappyPlanet)

## Print the coefficient estimates
coef(fitted_model)

##    (Intercept) LifeExpectancy 
##     -1.1037161      0.1035194

In the previous scatterplots we saw a linear relationship between Happiness and LifeExpectancy in the HappyPlanet data, allowing us to perform this linear regression.

In this example, $\beta_0$=-1.104 and $\beta_1$=0.104, so we estimate the happiness of a country whose citizens have an average life expectancy of zero to be -1.104 (which is a meaningless extrapolation), and we expect each 1-point increase in a country’s life expectancy to increase its predicted happiness score by 0.104 pts.

Summary() Function

The summary() function can be applied to your linear model to give us more info about it.

summary(fitted_model)

## 
## Call:
## lm(formula = Happiness ~ LifeExpectancy, data = HappyPlanet)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.27971 -0.58685 -0.00041  0.61388  1.47744 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.103716   0.398908  -2.767  0.00642 ** 
## LifeExpectancy  0.103519   0.005804  17.835  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7637 on 141 degrees of freedom
## Multiple R-squared:  0.6929, Adjusted R-squared:  0.6907 
## F-statistic: 318.1 on 1 and 141 DF,  p-value: < 2.2e-16

For now we will just focus on the Multiple R-squared and Estimate parts, but we will return to the rest later. The other parts correspond to hypothesis tests for slope/intercept coefficients being 0, with small p-value suggesting statistically discernible slope or intercept coefficients. Multiple R-squared is $R^2$ as we saw in last week’s lecture. This saves us from having to go back to the correlationplot to calculate $R^2$ using correlation.

Question 2 – Iris data

We are going to return back to the Iris dataset we used at the very end of the last lab. More info about the data is here.

# load `iris` dataset
data(iris)

Part A: Make a correlation heat-map matrix of the quantitative variables in the iris dataset. Which two variables have the strongest correlation?

Part B: Make a scatterplot of the two strongly correlated variables in Part A. Describe the relationship with context. Use the loess smoother to justify the form of the relationship.

Part C: Fit a regression model using Petal.Width to predict Petal.Length. Write out the coefficient values and interpret them.

Part D: Would it be appropriate to predict Petal.Length for an iris flower with Petal.Width = 4. What about Petal.Width = 0.8?

Part E: Suppose I collect a measurement on a different iris flower with a Petal length of 3.25 and a Petal width of 0.8. Calculate the predicted value for this flower according to the regression equation. Calculate and interpret the residual for this flower.

Linear Regression with Categorical Predictors

As we have seen in today’s lecture, we can extend linear regression to work with categorical predictors. We saw this briefly using the iris data in the last lab. We revisit it below.

## 
## Call:
## lm(formula = Petal.Length ~ Species, data = iris)
## 
## Coefficients:
##       (Intercept)  Speciesversicolor   Speciesvirginica  
##             1.462              2.798              4.090

By default R will make one of the categories a reference group. This impacts our coefficient interpretations. Notice that Setosa is our reference group.

1.462 is the predicted Petal length for a Setosa flower.

2.798 is the adjustment we make to get the prediction for versicolor’s petal length which will be 1.462 + 2.798 = 4.26.

4.09 is the adjustment we make to get the prediction for virginica’s petal length which will be 1.462 + 4.09 = 5.552.

Sanity check: Double check the above plot to make sure our calculations make sense.

If you want R not to use a reference group, use the following code which forces an intercept of 0.

lm(Petal.Length ~ Species+0, data=iris)

## 
## Call:
## lm(formula = Petal.Length ~ Species + 0, data = iris)
## 
## Coefficients:
##     Speciessetosa  Speciesversicolor   Speciesvirginica  
##             1.462              4.260              5.552

Now all of these coefficients correspond directly to the group predictions.

Question 3: Write out the linear regression equation for both of these lm() outputs corresponding to the iris species using indicator notation.

More on Indicators

Creating indicator variables (or One-Hot Encoding) is not necessary for running the lm() code in the previous examples. The following is a scenario where it could be helpful, and how to perform it using mutate and ifelse functions in R. Remember that ifelse first takes a condition, then performs a specified outcome if the condition is met, and gives the 2nd specified outcome if the condition is not met.

# indicators for regions 1 and 7
HappyPlanet_indicators = HappyPlanet %>% 
  mutate(Reg1_ind = as.factor(ifelse(Region == 1, 1, 0)),
         Reg7_ind = as.factor(ifelse(Region == 7, 1, 0)))

You can use indicators to subset or filter data (though this is not necessarily efficient). Indicators allow us to specify only certain regions to use in a regression. For example, the following code uses all Regions 1 through 7 in the Regions variable. The result is messy.

# regression for Happiness using Regions as predictors
# names could be a lot better
lm(data = HappyPlanet, Happiness~as.factor(Region)+0)

## 
## Call:
## lm(formula = Happiness ~ as.factor(Region) + 0, data = HappyPlanet)
## 
## Coefficients:
## as.factor(Region)1  as.factor(Region)2  as.factor(Region)3  as.factor(Region)4  
##              6.913               7.554               5.988               4.048  
## as.factor(Region)5  as.factor(Region)6  as.factor(Region)7  
##              5.586               6.317               5.737

If instead we were only interested in making predictions for regions 1 and 7, I could just use the corresponding indicators. So it seems indicators can be helpful for choosing a subset of categories.

# regression for Happiness only Regions 1 and 7
lm(data=HappyPlanet_indicators, Happiness~Reg1_ind+Reg7_ind)

## 
## Call:
## lm(formula = Happiness ~ Reg1_ind + Reg7_ind, data = HappyPlanet_indicators)
## 
## Coefficients:
## (Intercept)    Reg1_ind1    Reg7_ind1  
##     5.71304      1.19946      0.02399

Combining Quantitative and Categorical Predictors

Including categorical variable information into our regression can be useful to see if different groups do different things. Consider the Iris data again with the Petal.Length vs Petal.Width plot we saw previously, this time I’ve included information about the species.

## scatterplot
iris %>% ggplot(aes(Petal.Width,Petal.Length,color=Species)) + geom_point()

Question 5: What behavior can we generally see according to the addition of Species information. Should we be using the same linear regression equation to make predictions for each species?

Question 6: Modify the iris data regression from Question 2C to include the Species variable. There should be 4 total coefficients in the regression output.

Below is the output from the regression in Question 6 in the form of the scatterplot.

# install.packages("moderndive")
library(moderndive)
## scatterplot with regression
ggplot(data = iris, aes(Petal.Width, Petal.Length, color=Species)) +
  geom_point() + geom_parallel_slopes(se=F)

Question 7: Use the above scatterplot output to make interpretations of the coefficients in Question 6. Hint: Remember reference groups!

The following is a modification to the linear model using all of the same variables we just were using, but is specified in a different way. It allows all groups to have different slopes and intercepts, not just the same slopes with different intercepts.

# scatterplot
ggplot(data = iris, aes(Petal.Width, Petal.Length, color=Species)) +
  geom_point() + geom_smooth(method = "lm", se=F)

# regression
lm(data=iris, Petal.Length ~ Petal.Width*Species)

## 
## Call:
## lm(formula = Petal.Length ~ Petal.Width * Species, data = iris)
## 
## Coefficients:
##                   (Intercept)                    Petal.Width  
##                        1.3276                         0.5465  
##             Speciesversicolor               Speciesvirginica  
##                        0.4537                         2.9131  
## Petal.Width:Speciesversicolor   Petal.Width:Speciesvirginica  
##                        1.3228                         0.1008

# regression to just the Setosa group
lm(data=iris[iris$Species=="setosa",], Petal.Length~Petal.Width)

## 
## Call:
## lm(formula = Petal.Length ~ Petal.Width, data = iris[iris$Species == 
##     "setosa", ])
## 
## Coefficients:
## (Intercept)  Petal.Width  
##      1.3276       0.5465

Question 9: Interpret the Intercept, and Petal.Width coefficients in the lm(data=iris, Petal.Length ~ Petal.Width*Species) model. This should match the regression for just Setosa data.

Question 10: (Tricky) Interpret the Speciesversicolor and Petal.Width:Speciesversicolor coefficients. Hint: It may be helpful to compare the previous coefficients in Question 9 with the following regression output and see how they match up. They should be slope and intercept adjustments.

lm(data=iris[iris$Species=="versicolor",], Petal.Length~Petal.Width)

## 
## Call:
## lm(formula = Petal.Length ~ Petal.Width, data = iris[iris$Species == 
##     "versicolor", ])
## 
## Coefficients:
## (Intercept)  Petal.Width  
##       1.781        1.869

Question 11: Using the lm(data=iris, Petal.Length ~ Petal.Width*Species) model, what is the predicted petal length for a virginica flower with petal width of 2. How does this compare to the regression model that ignored the species variable? Are the two answers very different?

We can look at $R^2$ output from the different linear regression models to get a quick sense of how the different models are performing. Remember that $R^2$ is the ‘percentage of variation in y explained by our regression model’, with values closer to 1 indicating better fit and values closer to 0 indicating not very good fits. A strong correlation value of 0.7 would translate to an $R^2$ of $0.7^2 = 0.49$ for reference. “Multiple” $R^2$ generalizes to using multiple predictors and is not directly translateable back to a single correlation $r$ but otherwise still tells us how well the model fits.

The following code gives the $R^2$ values for the different linear models with the iris data.

# regression ignoring species data
summary(lm(data=iris, Petal.Length ~ Petal.Width))$r.squared

## [1] 0.9271098

# regression with parallel slopes, intercept depends on species
summary(lm(data=iris, Petal.Length ~ Petal.Width + Species))$r.squared

## [1] 0.9551318

# regression with different slopes and different intercepts for each species
summary(lm(data=iris, Petal.Length ~ Petal.Width*Species))$r.squared

## [1] 0.9594774

Question 12: How well does it seem the regression ignoring species performs?

Question 13: How well does the model with parallel slopes perform compared to ignoring species? Is the model harder to interpret than when we ignored species? Does the increase in complexity seem justified by better performance?

Question 14: Does the addition of different slopes to the regression result in increased performance? Do you think increase in complexity is worth the better performance?