According to the Happy Planet Index: 2024 Report, “The Happy Planet Index is a measure of sustainable wellbeing, evaluating countries by how efficiently they deliver long, happy lives for their residents using our limited environmental resources.” We are actually going to use the 2012 data from Happy Planet Index for formatting reasons (it’s a pain to format the data set for this lab). A link to the site is available here.
Make a new RMarkdown document and give it an appropriate name. I only want to see your questions and necessary code for this lab. Make sure to use appropriate formatting so your answers are easy to read and distinguish from the questions. Make sure to copy the questions wording themselves to the new document as you answer them.
Copy the code below to read the 2012 Happy Planet data into R. Load the appropriate packages as well.
library(ggplot2)
library(dplyr)
data = read.table("https://nfriedrichsen.github.io/data/HappyPlanetIndex.txt", header=T, sep=",")
Happy = data.frame(data)
Happy$Region = as.factor(Happy$Region)
We are going to examine the following variables:
Happiness: Larger values indicate greater happiness, health, and well-being of the country’s citizens. Happiness does not have a measurement unit.
Footprint: Ecological Footprint is a measure of the (per capita) ecological impact a country has. Larger values indicate a greater ecological impact.
Life Expectancy: Life Expectancy refers to the average life expectancy (in years) for a country.
GDPperCapita: The GDP (gross domestic product) of a country divided by its population, measured in US dollars.
Region: Region of the country in the world.
Happiness with
LifeExpectancyLet’s go ahead and try to use LifeExpectancy to predict
Happiness. (Note:: Last lab we found this
was the variable that had the 2nd highest correlation with Happiness, we
couldn’t use Ecological Footprint because it had a curved relationship.
Use the scatterplot below and the correlation matrix above to answer the
following questions.
# Scatterplot of Happiness vs. LifeExpectancy
ggplot(Happy, aes(x=LifeExpectancy, y=Happiness)) + geom_point()
The following questions are mostly review but will start us off well.
Question 1A: Which of these is the explanatory and which is the response variable? Explain.
Question 1B: Describe the general relationship
between LifeExpectancy and Happiness.
Question 1C: Is it appropriate to use Pearson’s correlation to quantify the relationship between these two variables? Explain.
Question 1D: State the value of the correlation
between LifeExpectancy and Happiness.
Interpret the value of this correlation.
Next, we are going to fit the linear regression line to the
Happiness and LifeExpectancy scatterplot we
just saw a second ago. To do this, we are going to use the
lm() function in R. It stands for Linear Model. The syntax
that the function uses is variable1 ~ variable2 which means “predict
variable1 using variable 2”, and then we also need to tell R which data
set these variables are coming from. Take a second to read the code that
makes the linear regression line below, then use the output to answer
the following questions.
# Linear regression for HDI vs Happiness.
fit = lm(data=Happy, Happiness~LifeExpectancy)
fit
##
## Call:
## lm(formula = Happiness ~ LifeExpectancy, data = Happy)
##
## Coefficients:
## (Intercept) LifeExpectancy
## -1.1037 0.1035
# Plot regression on scatterplot
ggplot(Happy, aes(x=LifeExpectancy, y=Happiness)) + geom_point() +
geom_smooth(method='lm', se=F) +
geom_label(x=55, y=8, label = paste("Predicted Happiness = -1.104 + 0.104*LifeExpectancy"))
Question 2A: State the regression equation using the variable names.
Question 2B: What is the value of the slope? Interpret the value of the slope in context.
Question 2C: What is the value of the intercept? Would it be appropriate to interpret the y-intercept? If yes, interpret the value of the y-intercept. If not, explain why.
Question 2D: What is the predicted happiness for a country that has a life expectancy of 77.9 years? Show your calculation.
Question 2E: What is the value of the residual for
the United States? Interpret the value of the residual. The value of the
US’s LifeExpectancy and Happiness variables
are:
Happy %>% filter(Country == "United States of America") %>% select(LifeExpectancy, Happiness)
## LifeExpectancy Happiness
## 1 77.9 7.9
Question 2F: What is the value of the coefficient of determination (R^2) between the happiness of a country and its life expectancy? Interpret the value of R^2 (you may use the relationship between r and \(R^2\), but do not use correlation interpretation).
(The following is going to be adapting the linear regression stuff for a quantitative variable to use with categorical, much of this will be you figuring things out via context)
We are going to use linear regression using a categorical predictor
now. We will use a separate dataset for this, since the number of
categories in the Happy Planet data set is too large to practice these
ideas. Instead we will use the Iris dataset. We are going
to use the Species variable to help us predict values of
Petal.Length for the iris flowers. Use the following linear
regression results and corresponding graph to answer the following
questions.
##
## Call:
## lm(formula = Petal.Length ~ Species, data = iris)
##
## Coefficients:
## (Intercept) Speciesversicolor Speciesvirginica
## 1.462 2.798 4.090
Question 3A: What species does the intercept correspond to in the linear regression output?
Question 3B: Write out the linear regression equation with appropriate notation and context.
Question 3C: Using the regression equation: What is the predicted petal length for a Setosa flower?
Question 3D: Using the regression equation: What is the predicted petal length for a Versicolor flower?
Question 3E: Using the regression equation: What is the predicted petal length for a Virginica flower?
Question 3F: Explain how the regression equation relates to the graphs (i.e. which parts of the graphs do the slopes/intercept correspond to).
Creating indicator variables (or One-Hot Encoding) is not necessary
for running the lm() code in the previous examples. The
following is a scenario where it could be helpful, and how to perform it
using mutate and ifelse functions in
R. Remember that ifelse first takes a
condition, then performs a specified outcome if the condition is met,
and gives the 2nd specified outcome if the condition is not
met.
# indicators for regions 1 and 7
HappyPlanet_indicators = Happy %>%
mutate(Reg1_ind = as.factor(ifelse(Region == 1, 1, 0)),
Reg7_ind = as.factor(ifelse(Region == 7, 1, 0)))
You can use indicators to subset or filter data (though this is not
necessarily efficient). Indicators allow us to specify only certain
regions to use in a regression. For example, the following code uses all
Regions 1 through 7 in the Regions variable. The result is
messy, but gives us the averages for each region.
# regression for LifeExpectancy using Regions as predictors
# names could be a lot better
lm(data = Happy, Happiness~as.factor(Region)+0)
##
## Call:
## lm(formula = Happiness ~ as.factor(Region) + 0, data = Happy)
##
## Coefficients:
## as.factor(Region)1 as.factor(Region)2 as.factor(Region)3 as.factor(Region)4
## 6.913 7.554 5.988 4.048
## as.factor(Region)5 as.factor(Region)6 as.factor(Region)7
## 5.586 6.317 5.737
Then we can add these variables to a regression like the following.
# regression for Happiness using Life Expectancy and Region 1
lm(data = HappyPlanet_indicators, Happiness~LifeExpectancy+Reg1_ind)
##
## Call:
## lm(formula = Happiness ~ LifeExpectancy + Reg1_ind, data = HappyPlanet_indicators)
##
## Coefficients:
## (Intercept) LifeExpectancy Reg1_ind1
## -0.94441 0.09948 0.68509
Question 4: Attempt to interpret the coefficients in the previous regression using Region 1 as an indicator predictor. Writing out the equation for different groups may help.
Question 5: Using the Iris data, create
indicator variables for each of the 3 species. Then create a new linear
regression using the indicators to verify we get the exact same
coefficients as in our previous Iris regression.
The following code loads in a dataset corresponding to measurements made on 31 black cherry trees that were chopped down. Units: diameter is in inches, volume is in cubic feet, height is in feet.
cherry = trees %>% rename(Diameter = Girth)
head(cherry)
## Diameter Height Volume
## 1 8.3 70 10.3
## 2 8.6 65 10.3
## 3 8.8 63 10.2
## 4 10.5 72 16.4
## 5 10.7 81 18.8
## 6 10.8 83 19.7
Question 7A: Describe the relationship between volume and height of these trees.
Question 7B: Describe the relationship between volume and diameter of these trees.
Question 7C: Suppose you have height and diameter measurements for another black cherry tree. Which of these variables would be preferable to use to predict the volume of timber in this tree using a simple linear regression model? Explain your reasoning.
Question 7D: Create the linear regression to predict tree volume using the variable you identified in part (c) as the predictor. Interpret the coefficients.
Question 7E: Predict the volume of cherry trees with diameters 14in and 25 inches. Are both of these predictions reasonable to make?