library(ggplot2)
library(dplyr)
data = read.table("https://nfriedrichsen.github.io/data/HappyPlanetIndex.txt", header=T, sep=",")
Happy = data.frame(data)
Happy$Region = as.factor(Happy$Region)
Happiness with
LifeExpectancyLet’s go ahead and try to use LifeExpectancy to predict
Happiness. Use the scatterplot below and the correlation
matrix above to answer the following questions.
# Scatterplot of Happiness vs. LifeExpectancy
ggplot(Happy, aes(x=LifeExpectancy, y=Happiness)) + geom_point()
Question 1A: Which of these is the explanatory and which is the response variable? Explain.
Happiness is the response since that is what we are predicting, life expectancy is the explanatory because we are using it to predict.
Question 1B: Describe the general relationship
between LifeExpectancy and Happiness.
There is a positive moderately strong linear relationship between Life Expectancy and Happiness.
Question 1C: Is it appropriate to use Pearson’s correlation to quantify the relationship between these two variables? Explain.
Yes. The relationship is linear, so Pearson’s correlation can be used.
Question 1D: State the value of the correlation
between LifeExpectancy and Happiness.
Interpret the value of this correlation.
The correlation is 0.833. It tells us that the linear relationship between the variables is positive and strong.
Next, we are going to fit the linear regression line to the
Happiness and LifeExpectancy scatterplot we
just saw a second ago. To do this, we are going to use the
lm() function in R. It stands for Linear Model. The syntax
that the function uses is variable1 ~ variable2 which means “predict
variable1 using variable 2”, and then we also need to tell R which data
set these variables are coming from. Take a second to read the code that
makes the linear regression line below, then use the output to answer
the following questions.
# Linear regression for HDI vs Happiness.
fit = lm(data=Happy, Happiness~LifeExpectancy)
fit
##
## Call:
## lm(formula = Happiness ~ LifeExpectancy, data = Happy)
##
## Coefficients:
## (Intercept) LifeExpectancy
## -1.1037 0.1035
# Plot regression on scatterplot
ggplot(Happy, aes(x=LifeExpectancy, y=Happiness)) + geom_point() +
geom_smooth(method='lm', se=F) +
geom_label(x=55, y=8, label = paste("Predicted Happiness = -1.104 + 0.104*LifeExpectancy"))
Question 2A: State the regression equation using the variable names.
Predicted Happiness = -1.104 + 0.104 \(\times\) LifeExpectancy.
Question 2B: What is the value of the slope? Interpret the value of the slope in context.
The slope value is 0.104. For every 1-year increase in life expectancy, the predicted Happiness increases by 0.104.
Question 2C: What is the value of the intercept? Would it be appropriate to interpret the y-intercept? If yes, interpret the value of the y-intercept. If not, explain why.
The value of the intercept is -1.104. It is not appropriate to interpret because happiness can’t be negative and LifeExpectancy for a country cannot reasonably be zero.
Question 2D: What is the predicted happiness for a country that has a life expectancy of 77.9 years? Show your calculation.
-1.104 + 0.104*(77.9)
## [1] 6.9976
Question 2E: What is the value of the residual for
the United States? Interpret the value of the residual. The value of the
US’s LifeExpectancy and Happiness variables
are:
Happy %>% filter(Country == "United States of America") %>% select(LifeExpectancy, Happiness)
## LifeExpectancy Happiness
## 1 77.9 7.9
# residual: e = y - y_hat = observed - predicted Happiness
7.9 - 6.9976
## [1] 0.9024
We have under-predicted the US’s happiness by 0.9024.
Question 2F: What is the value of the coefficient of determination (R^2) between the happiness of a country and its life expectancy? Interpret the value of R^2 (do not use correlation interpretation).
cor(Happy$Happiness, Happy$LifeExpectancy)^2
## [1] 0.6928597
We can square the correlation value we got earlier to get \(R^2 = 0.693\). 69.3% of variation in Happiness values for countries can be explained with our regression model using Life Expectancy as our predictor.
(The following is going to be adapting the linear regression stuff for a quantitative variable to use with categorical, much of this will be you figuring things out via context)
We are going to use linear regression using a categorical predictor
now. We will use a separate dataset for this, since the number of
categories in the Happy Planet data set is too large to practice these
ideas. Instead we will use the Iris dataset. We are going
to use the Species variable to help us predict values of
Petal.Length for the iris flowers. Use the following linear
regression results and corresponding graph to answer the following
questions.
##
## Call:
## lm(formula = Petal.Length ~ Species, data = iris)
##
## Coefficients:
## (Intercept) Speciesversicolor Speciesvirginica
## 1.462 2.798 4.090
Question 3A: What species does the intercept correspond to in the linear regression output?
Answer: Setosa
Question 3B: Write out the linear regression equation with appropriate notation and context.
Answer: Predicted petal length = 1.462 + 2.789\(\times\)1\(_{versicolor}\) + 4.09\(\times\)1\(_{virginica}\)
Question 3C: Using the regression equation: What is the predicted petal length for a Setosa flower?
Answer: 1.462
Question 3D: Using the regression equation: What is the predicted petal length for a Versicolor flower?
Answer: 1.462+2.798 = 4.26
Question 3E: Using the regression equation: What is the predicted petal length for a Virginica flower?
Answer: 1.462+4.090 = 5.552
Question 3F: Explain how the regression equation relates to the graphs (i.e. which parts of the graphs do the slopes/intercept correspond to).
Answer: The intercept corresponds to the mean for setosa. The slopes make adjustments to the intercept to give us the means of versicolor and virginica groups.
Creating indicator variables (or One-Hot Encoding) is not necessary
for running the lm() code in the previous examples. The
following is a scenario where it could be helpful, and how to perform it
using mutate and ifelse functions in
R. Remember that ifelse first takes a
condition, then performs a specified outcome if the condition is met,
and gives the 2nd specified outcome if the condition is not
met.
# indicators for regions 1 and 7
HappyPlanet_indicators = Happy %>%
mutate(Reg1_ind = as.factor(ifelse(Region == 1, 1, 0)),
Reg7_ind = as.factor(ifelse(Region == 7, 1, 0)))
You can use indicators to subset or filter data (though this is not
necessarily efficient). Indicators allow us to specify only certain
regions to use in a regression. For example, the following code uses all
Regions 1 through 7 in the Regions variable. The result is
messy.
# regression for LifeExpectancy using Regions as predictors
# names could be a lot better
lm(data = Happy, Happiness~as.factor(Region)+0)
##
## Call:
## lm(formula = Happiness ~ as.factor(Region) + 0, data = Happy)
##
## Coefficients:
## as.factor(Region)1 as.factor(Region)2 as.factor(Region)3 as.factor(Region)4
## 6.913 7.554 5.988 4.048
## as.factor(Region)5 as.factor(Region)6 as.factor(Region)7
## 5.586 6.317 5.737
Then we can add these variables to a regression like the following.
# regression for Happiness using Life Expectancy and Regions 1 and 2
lm(data = HappyPlanet_indicators, Happiness~LifeExpectancy+Reg1_ind+Reg7_ind)
##
## Call:
## lm(formula = Happiness ~ LifeExpectancy + Reg1_ind + Reg7_ind,
## data = HappyPlanet_indicators)
##
## Coefficients:
## (Intercept) LifeExpectancy Reg1_ind1 Reg7_ind1
## -1.0563 0.1029 0.5492 -0.5215
Question 4: Attempt to interpret the coefficients in the previous regression using Region 1 as an indicator predictor. Writing out the equation for different groups may help.
Answer: For countries in Regions 2-6, the regression
equation is predicted Happiness = -1.0563 + .1029 x LE.
The indicator is an intercept adjustment made that tells us for
countries in Region 1 we should actually be using the equation
predicted Happiness = (-1.0563-.5215) + .1029 x LE = -1.5778 + .1029 x LE.
Which means Region 1 has a lower baseline compared to Regions 2-6
Question 5: Using the Iris data, create
indicator variables for each of the 3 species. Then create a new linear
regression using the indicators to verify we get the exact same
coefficients.
iris_ind = iris %>%
mutate(setosa_ind = ifelse(Species=="setosa",1,0),
virginica_ind = ifelse(Species=="virginica",1,0),
versicolor_ind = ifelse(Species=="versicolor",1,0))
head(iris_ind) %>% select(-Species)
## Sepal.Length Sepal.Width Petal.Length Petal.Width setosa_ind virginica_ind
## 1 5.1 3.5 1.4 0.2 1 0
## 2 4.9 3.0 1.4 0.2 1 0
## 3 4.7 3.2 1.3 0.2 1 0
## 4 4.6 3.1 1.5 0.2 1 0
## 5 5.0 3.6 1.4 0.2 1 0
## 6 5.4 3.9 1.7 0.4 1 0
## versicolor_ind
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
lm(data=iris_ind, Petal.Length~setosa_ind+virginica_ind+versicolor_ind)
##
## Call:
## lm(formula = Petal.Length ~ setosa_ind + virginica_ind + versicolor_ind,
## data = iris_ind)
##
## Coefficients:
## (Intercept) setosa_ind virginica_ind versicolor_ind
## 4.260 -2.798 1.292 NA
predicted percent below poverty line = 4.60 + 2.05 * (Unemployment Rate)
For a county with no unemployment, we predict 4.60% of the population to be below the poverty line.
For every 1% increase of unemployment in a county, we predict that 2.05% more of the population will be below the poverty line.
46% of variability in percent below the poverty line of a county is explained using this linear regression with unemployment rate as a predictor.
correlation = 0.678233, make sure to double check positive sign from scatterplot (positive relationship)
sqrt(.46)
## [1] 0.678233
cherry = trees %>% rename(Diameter = Girth)
cherry %>% ggplot(aes(x = Height, y = Volume)) + geom_point()
cherry %>% ggplot(aes(x = Diameter, y = Volume)) + geom_point()
The relationship between height and volume for these trees is moderate, positive, and linear. There is an outlier for large volume.
The relationship between diameter and volume for these trees is very strong, positive, and linear. There is an outlier for large volume and diameter.
Diameter. The relationship is much stronger between volume and this variable.
lm(data = cherry, Volume~Diameter)
##
## Call:
## lm(formula = Volume ~ Diameter, data = cherry)
##
## Coefficients:
## (Intercept) Diameter
## -36.943 5.066
Intercept: For a cherry tree with a diameter of 0 we predict a volume of -37 cubic feet. Impossible. Diameter cannot be 0, volume cannot be negative!
Slope: For every 1inch increase in diameter the predicted volume of a tree goes up by 5.066 cubic feet.
-36.943 + 5.066*14
## [1] 33.981
-36.943 + 5.066*25
## [1] 89.707