library(ggplot2)
library(dplyr)

data = read.table("https://nfriedrichsen.github.io/data/HappyPlanetIndex.txt", header=T, sep=",")
Happy = data.frame(data)
Happy$Region = as.factor(Happy$Region)

Predicting `Happiness` with `LifeExpectancy`

Let’s go ahead and try to use LifeExpectancy to predict Happiness. Use the scatterplot below and the correlation matrix above to answer the following questions.

# Scatterplot of Happiness vs. LifeExpectancy
ggplot(Happy, aes(x=LifeExpectancy, y=Happiness)) + geom_point()

Question 1A: Which of these is the explanatory and which is the response variable? Explain.

Happiness is the response since that is what we are predicting, life expectancy is the explanatory because we are using it to predict.

Question 1B: Describe the general relationship between LifeExpectancy and Happiness.

There is a positive moderately strong linear relationship between Life Expectancy and Happiness.

Question 1C: Is it appropriate to use Pearson’s correlation to quantify the relationship between these two variables? Explain.

Yes. The relationship is linear, so Pearson’s correlation can be used.

Question 1D: State the value of the correlation between LifeExpectancy and Happiness. Interpret the value of this correlation.

The correlation is 0.833. It tells us that the linear relationship between the variables is positive and strong.

Linear Regression

Next, we are going to fit the linear regression line to the Happiness and LifeExpectancy scatterplot we just saw a second ago. To do this, we are going to use the lm() function in R. It stands for Linear Model. The syntax that the function uses is variable1 ~ variable2 which means “predict variable1 using variable 2”, and then we also need to tell R which data set these variables are coming from. Take a second to read the code that makes the linear regression line below, then use the output to answer the following questions.

# Linear regression for HDI vs Happiness.
fit = lm(data=Happy, Happiness~LifeExpectancy)
fit

## 
## Call:
## lm(formula = Happiness ~ LifeExpectancy, data = Happy)
## 
## Coefficients:
##    (Intercept)  LifeExpectancy  
##        -1.1037          0.1035

# Plot regression on scatterplot
ggplot(Happy, aes(x=LifeExpectancy, y=Happiness)) + geom_point() +
  geom_smooth(method='lm', se=F) +
  geom_label(x=55, y=8, label = paste("Predicted Happiness = -1.104 + 0.104*LifeExpectancy"))

Question 2A: State the regression equation using the variable names.

Predicted Happiness = -1.104 + 0.104 \(\times\) LifeExpectancy.

Question 2B: What is the value of the slope? Interpret the value of the slope in context.

The slope value is 0.104. For every 1-year increase in life expectancy, the predicted Happiness increases by 0.104.

Question 2C: What is the value of the intercept? Would it be appropriate to interpret the y-intercept? If yes, interpret the value of the y-intercept. If not, explain why.

The value of the intercept is -1.104. It is not appropriate to interpret because happiness can’t be negative and LifeExpectancy for a country cannot reasonably be zero.

Question 2D: What is the predicted happiness for a country that has a life expectancy of 77.9 years? Show your calculation.

-1.104 + 0.104*(77.9)

## [1] 6.9976

Question 2E: What is the value of the residual for the United States? Interpret the value of the residual. The value of the US’s LifeExpectancy and Happiness variables are:

Happy %>% filter(Country == "United States of America") %>% select(LifeExpectancy, Happiness)

##   LifeExpectancy Happiness
## 1           77.9       7.9

# residual: e = y - y_hat = observed - predicted Happiness
7.9 - 6.9976

## [1] 0.9024

We have under-predicted the US’s happiness by 0.9024.

Question 2F: What is the value of the coefficient of determination (R^2) between the happiness of a country and its life expectancy? Interpret the value of R^2 (do not use correlation interpretation).

cor(Happy$Happiness, Happy$LifeExpectancy)^2

## [1] 0.6928597

We can square the correlation value we got earlier to get \(R^2 = 0.693\). 69.3% of variation in Happiness values for countries can be explained with our regression model using Life Expectancy as our predictor.

Linear Regression (Categorical Predictor)

(The following is going to be adapting the linear regression stuff for a quantitative variable to use with categorical, much of this will be you figuring things out via context)

We are going to use linear regression using a categorical predictor now. We will use a separate dataset for this, since the number of categories in the Happy Planet data set is too large to practice these ideas. Instead we will use the Iris dataset. We are going to use the Species variable to help us predict values of Petal.Length for the iris flowers. Use the following linear regression results and corresponding graph to answer the following questions.

## 
## Call:
## lm(formula = Petal.Length ~ Species, data = iris)
## 
## Coefficients:
##       (Intercept)  Speciesversicolor   Speciesvirginica  
##             1.462              2.798              4.090

Question 3A: What species does the intercept correspond to in the linear regression output?

Answer: Setosa

Question 3B: Write out the linear regression equation with appropriate notation and context.

Answer: Predicted petal length = 1.462 + 2.789\(\times\)1\(_{versicolor}\) + 4.09\(\times\)1\(_{virginica}\)

Question 3C: Using the regression equation: What is the predicted petal length for a Setosa flower?

Answer: 1.462

Question 3D: Using the regression equation: What is the predicted petal length for a Versicolor flower?

Answer: 1.462+2.798 = 4.26

Question 3E: Using the regression equation: What is the predicted petal length for a Virginica flower?

Answer: 1.462+4.090 = 5.552

Question 3F: Explain how the regression equation relates to the graphs (i.e. which parts of the graphs do the slopes/intercept correspond to).

Answer: The intercept corresponds to the mean for setosa. The slopes make adjustments to the intercept to give us the means of versicolor and virginica groups.

More on Indicators

Creating indicator variables (or One-Hot Encoding) is not necessary for running the lm() code in the previous examples. The following is a scenario where it could be helpful, and how to perform it using mutate and ifelse functions in R. Remember that ifelse first takes a condition, then performs a specified outcome if the condition is met, and gives the 2nd specified outcome if the condition is not met.

# indicators for regions 1 and 7
HappyPlanet_indicators = Happy %>%
  mutate(Reg1_ind = as.factor(ifelse(Region == 1, 1, 0)),
         Reg7_ind = as.factor(ifelse(Region == 7, 1, 0)))

You can use indicators to subset or filter data (though this is not necessarily efficient). Indicators allow us to specify only certain regions to use in a regression. For example, the following code uses all Regions 1 through 7 in the Regions variable. The result is messy.

# regression for LifeExpectancy using Regions as predictors
# names could be a lot better
lm(data = Happy, Happiness~as.factor(Region)+0)

## 
## Call:
## lm(formula = Happiness ~ as.factor(Region) + 0, data = Happy)
## 
## Coefficients:
## as.factor(Region)1  as.factor(Region)2  as.factor(Region)3  as.factor(Region)4  
##              6.913               7.554               5.988               4.048  
## as.factor(Region)5  as.factor(Region)6  as.factor(Region)7  
##              5.586               6.317               5.737

Then we can add these variables to a regression like the following.

# regression for Happiness using Life Expectancy and Regions 1 and 2
lm(data = HappyPlanet_indicators, Happiness~LifeExpectancy+Reg1_ind+Reg7_ind)

## 
## Call:
## lm(formula = Happiness ~ LifeExpectancy + Reg1_ind + Reg7_ind, 
##     data = HappyPlanet_indicators)
## 
## Coefficients:
##    (Intercept)  LifeExpectancy       Reg1_ind1       Reg7_ind1  
##        -1.0563          0.1029          0.5492         -0.5215

Question 4: Attempt to interpret the coefficients in the previous regression using Region 1 as an indicator predictor. Writing out the equation for different groups may help.

Answer: For countries in Regions 2-6, the regression equation is predicted Happiness = -1.0563 + .1029 x LE.

The indicator is an intercept adjustment made that tells us for countries in Region 1 we should actually be using the equation predicted Happiness = (-1.0563-.5215) + .1029 x LE = -1.5778 + .1029 x LE. Which means Region 1 has a lower baseline compared to Regions 2-6

Question 5: Using the Iris data, create indicator variables for each of the 3 species. Then create a new linear regression using the indicators to verify we get the exact same coefficients.

iris_ind = iris %>%
  mutate(setosa_ind = ifelse(Species=="setosa",1,0),
         virginica_ind = ifelse(Species=="virginica",1,0),
         versicolor_ind = ifelse(Species=="versicolor",1,0))
head(iris_ind) %>% select(-Species)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width setosa_ind virginica_ind
## 1          5.1         3.5          1.4         0.2          1             0
## 2          4.9         3.0          1.4         0.2          1             0
## 3          4.7         3.2          1.3         0.2          1             0
## 4          4.6         3.1          1.5         0.2          1             0
## 5          5.0         3.6          1.4         0.2          1             0
## 6          5.4         3.9          1.7         0.4          1             0
##   versicolor_ind
## 1              0
## 2              0
## 3              0
## 4              0
## 5              0
## 6              0

lm(data=iris_ind, Petal.Length~setosa_ind+virginica_ind+versicolor_ind)

## 
## Call:
## lm(formula = Petal.Length ~ setosa_ind + virginica_ind + versicolor_ind, 
##     data = iris_ind)
## 
## Coefficients:
##    (Intercept)      setosa_ind   virginica_ind  versicolor_ind  
##          4.260          -2.798           1.292              NA

Question 6 (Unemployment)

predicted percent below poverty line = 4.60 + 2.05 * (Unemployment Rate)
For a county with no unemployment, we predict 4.60% of the population to be below the poverty line.
For every 1% increase of unemployment in a county, we predict that 2.05% more of the population will be below the poverty line.
46% of variability in percent below the poverty line of a county is explained using this linear regression with unemployment rate as a predictor.
correlation = 0.678233, make sure to double check positive sign from scatterplot (positive relationship)

sqrt(.46)

## [1] 0.678233

Question 7 (Cherry Trees)

cherry = trees %>% rename(Diameter = Girth)

cherry %>% ggplot(aes(x = Height, y = Volume)) + geom_point()

cherry %>% ggplot(aes(x = Diameter, y = Volume)) + geom_point()

The relationship between height and volume for these trees is moderate, positive, and linear. There is an outlier for large volume.
The relationship between diameter and volume for these trees is very strong, positive, and linear. There is an outlier for large volume and diameter.
Diameter. The relationship is much stronger between volume and this variable.

lm(data = cherry, Volume~Diameter)

## 
## Call:
## lm(formula = Volume ~ Diameter, data = cherry)
## 
## Coefficients:
## (Intercept)     Diameter  
##     -36.943        5.066

Intercept: For a cherry tree with a diameter of 0 we predict a volume of -37 cubic feet. Impossible. Diameter cannot be 0, volume cannot be negative!

Slope: For every 1inch increase in diameter the predicted volume of a tree goes up by 5.066 cubic feet.

-36.943 + 5.066*14

## [1] 33.981

-36.943 + 5.066*25

## [1] 89.707

Only the prediction for a diameter of 14 is reasonable. The prediction for a tree with diameter of 25 is not reasonable as this is extrapolation.