Happy Planet Data

According to the Happy Planet Index: 2024 Report, “The Happy Planet Index is a measure of sustainable wellbeing, evaluating countries by how efficiently they deliver long, happy lives for their residents using our limited environmental resources.” We are actually going to use the 2012 data from Happy Planet Index for formatting reasons (it’s a pain to format the data set for this lab). A link to the site is available here.

Make a new RMarkdown document and give it an appropriate name. I only want to see your questions and necessary code for this lab. Make sure to use appropriate formatting so your answers are easy to read and distinguish from the questions. Make sure to copy the questions wording themselves to the new document as you answer them.

Copy the code below to read the 2012 Happy Planet data into R. Load the appropriate packages as well.

library(ggplot2)
library(dplyr)

data = read.table("https://nfriedrichsen.github.io/data/HappyPlanetIndex.txt", header=T, sep=",")
Happy = data.frame(data)
Happy$Region = as.factor(Happy$Region)

We are going to examine the following variables:

Happiness: Larger values indicate greater happiness, health, and well-being of the country’s citizens. Happiness does not have a measurement unit.

Footprint: Ecological Footprint is a measure of the (per capita) ecological impact a country has. Larger values indicate a greater ecological impact.

Life Expectancy: Life Expectancy refers to the average life expectancy (in years) for a country.

GDPperCapita: The GDP (gross domestic product) of a country divided by its population, measured in US dollars.

Region: Region of the country in the world.


Predicting Happiness with LifeExpectancy

Let’s go ahead and try to use LifeExpectancy to predict Happiness. (Note:: Last lab we found this was the variable that had the 2nd highest correlation with Happiness, we couldn’t use Ecological Footprint because it had a curved relationship. Use the scatterplot below and the correlation matrix above to answer the following questions.

# Scatterplot of Happiness vs. LifeExpectancy
ggplot(Happy, aes(x=LifeExpectancy, y=Happiness)) + geom_point()

The following questions are mostly review but will start us off well.

Question 1A: Which of these is the explanatory and which is the response variable? Explain.

Question 1B: Describe the general relationship between LifeExpectancy and Happiness.

Question 1C: Is it appropriate to use Pearson’s correlation to quantify the relationship between these two variables? Explain.

Question 1D: State the value of the correlation between LifeExpectancy and Happiness. Interpret the value of this correlation.


Linear Regression

Next, we are going to fit the linear regression line to the Happiness and LifeExpectancy scatterplot we just saw a second ago. To do this, we are going to use the lm() function in R. It stands for Linear Model. The syntax that the function uses is variable1 ~ variable2 which means “predict variable1 using variable 2”, and then we also need to tell R which data set these variables are coming from. Take a second to read the code that makes the linear regression line below, then use the output to answer the following questions.

# Linear regression for HDI vs Happiness.
fit = lm(data=Happy, Happiness~LifeExpectancy)
fit
## 
## Call:
## lm(formula = Happiness ~ LifeExpectancy, data = Happy)
## 
## Coefficients:
##    (Intercept)  LifeExpectancy  
##        -1.1037          0.1035
# Plot regression on scatterplot
ggplot(Happy, aes(x=LifeExpectancy, y=Happiness)) + geom_point() +
  geom_smooth(method='lm', se=F) +
  geom_label(x=55, y=8, label = paste("Predicted Happiness = -1.104 + 0.104*LifeExpectancy"))

Question 2A: State the regression equation using the variable names.

Question 2B: What is the value of the slope? Interpret the value of the slope in context.

Question 2C: What is the value of the intercept? Would it be appropriate to interpret the y-intercept? If yes, interpret the value of the y-intercept. If not, explain why.

Question 2D: What is the predicted happiness for a country that has a life expectancy of 77.9 years? Show your calculation.

Question 2E: What is the value of the residual for the United States? Interpret the value of the residual. The value of the US’s LifeExpectancy and Happiness variables are:

Happy %>% filter(Country == "United States of America") %>% select(LifeExpectancy, Happiness)
##   LifeExpectancy Happiness
## 1           77.9       7.9

Question 2F: What is the value of the coefficient of determination (R^2) between the happiness of a country and its life expectancy? Interpret the value of R^2 (you may use the relationship between r and \(R^2\), but do not use correlation interpretation).


Linear Regression (Categorical Predictor)

(The following is going to be adapting the linear regression stuff for a quantitative variable to use with categorical, much of this will be you figuring things out via context)

We are going to use linear regression using a categorical predictor now. We will use a separate dataset for this, since the number of categories in the Happy Planet data set is too large to practice these ideas. Instead we will use the Iris dataset. We are going to use the Species variable to help us predict values of Petal.Length for the iris flowers. Use the following linear regression results and corresponding graph to answer the following questions.

## 
## Call:
## lm(formula = Petal.Length ~ Species, data = iris)
## 
## Coefficients:
##       (Intercept)  Speciesversicolor   Speciesvirginica  
##             1.462              2.798              4.090

Question 3A: What species does the intercept correspond to in the linear regression output?

Question 3B: Write out the linear regression equation with appropriate notation and context.

Question 3C: Using the regression equation: What is the predicted petal length for a Setosa flower?

Question 3D: Using the regression equation: What is the predicted petal length for a Versicolor flower?

Question 3E: Using the regression equation: What is the predicted petal length for a Virginica flower?

Question 3F: Explain how the regression equation relates to the graphs (i.e. which parts of the graphs do the slopes/intercept correspond to).

More on Indicators

Creating indicator variables (or One-Hot Encoding) is not necessary for running the lm() code in the previous examples. The following is a scenario where it could be helpful, and how to perform it using mutate and ifelse functions in R. Remember that ifelse first takes a condition, then performs a specified outcome if the condition is met, and gives the 2nd specified outcome if the condition is not met.

# indicators for regions 1 and 7
HappyPlanet_indicators = Happy %>% 
  mutate(Reg1_ind = as.factor(ifelse(Region == 1, 1, 0)),
         Reg7_ind = as.factor(ifelse(Region == 7, 1, 0)))

You can use indicators to subset or filter data (though this is not necessarily efficient). Indicators allow us to specify only certain regions to use in a regression. For example, the following code uses all Regions 1 through 7 in the Regions variable. The result is messy, but gives us the averages for each region.

# regression for LifeExpectancy using Regions as predictors
# names could be a lot better
lm(data = Happy, Happiness~as.factor(Region)+0)
## 
## Call:
## lm(formula = Happiness ~ as.factor(Region) + 0, data = Happy)
## 
## Coefficients:
## as.factor(Region)1  as.factor(Region)2  as.factor(Region)3  as.factor(Region)4  
##              6.913               7.554               5.988               4.048  
## as.factor(Region)5  as.factor(Region)6  as.factor(Region)7  
##              5.586               6.317               5.737

Then we can add these variables to a regression like the following.

# regression for Happiness using Life Expectancy and Region 1
lm(data = HappyPlanet_indicators, Happiness~LifeExpectancy+Reg1_ind)
## 
## Call:
## lm(formula = Happiness ~ LifeExpectancy + Reg1_ind, data = HappyPlanet_indicators)
## 
## Coefficients:
##    (Intercept)  LifeExpectancy       Reg1_ind1  
##       -0.94441         0.09948         0.68509

Question 4: Attempt to interpret the coefficients in the previous regression using Region 1 as an indicator predictor. Writing out the equation for different groups may help.

Question 5: Using the Iris data, create indicator variables for each of the 3 species. Then create a new linear regression using the indicators to verify we get the exact same coefficients as in our previous Iris regression.


Question 6 (Unemployment)

Question 7 (Cherry Trees)

The following code loads in a dataset corresponding to measurements made on 31 black cherry trees that were chopped down. Units: diameter is in inches, volume is in cubic feet, height is in feet.

cherry = trees %>% rename(Diameter = Girth)
head(cherry)
##   Diameter Height Volume
## 1      8.3     70   10.3
## 2      8.6     65   10.3
## 3      8.8     63   10.2
## 4     10.5     72   16.4
## 5     10.7     81   18.8
## 6     10.8     83   19.7

Question 7A: Describe the relationship between volume and height of these trees.

Question 7B: Describe the relationship between volume and diameter of these trees.

Question 7C: Suppose you have height and diameter measurements for another black cherry tree. Which of these variables would be preferable to use to predict the volume of timber in this tree using a simple linear regression model? Explain your reasoning.

Question 7D: Create the linear regression to predict tree volume using the variable you identified in part (c) as the predictor. Interpret the coefficients.

Question 7E: Predict the volume of cherry trees with diameters 14in and 25 inches. Are both of these predictions reasonable to make?