Happy Planet Data

According to the Happy Planet Index: 2024 Report, “The Happy Planet Index is a measure of sustainable wellbeing, evaluating countries by how efficiently they deliver long, happy lives for their residents using our limited environmental resources.” We are actually going to use the 2012 data from Happy Planet Index for formatting reasons (it’s a pain to format the data set for this lab). A link to the site is available here.

Make a new RMarkdown document and give it an appropriate name. I only want to see your questions and necessary code for this lab. Make sure to use appropriate formatting so your answers are easy to read and distinguish from the questions. Make sure to copy the questions wording themselves to the new document as you answer them.

Copy the code below to read the 2012 Happy Planet data into R. Load the appropriate packages as well.

library(ggplot2)
library(dplyr)

data = read.table("https://nfriedrichsen.github.io/data/HappyPlanetIndex.txt", header=T, sep=",")
Happy = data.frame(data)
Happy$Region = as.factor(Happy$Region)

We are going to examine the following variables:

Happiness: Larger values indicate greater happiness, health, and well-being of the country’s citizens. Happiness does not have a measurement unit.

Footprint: Ecological Footprint is a measure of the (per capita) ecological impact a country has. Larger values indicate a greater ecological impact.

Life Expectancy: Life Expectancy refers to the average life expectancy (in years) for a country.

GDPperCapita: The GDP (gross domestic product) of a country divided by its population, measured in US dollars.

Region: Region of the country in the world.

Correlation

Predicting `Happiness` with `Footprint`.

Suppose we want to answer the question “Is ecological footprint a linear predictor of happiness?” Look at the scatterplot below and answer the following questions.

# Scatterplot of Footprint vs Happiness
theme_set(theme_bw())
ggplot(Happy, aes(x=Footprint, y=Happiness)) + geom_point()

Question 1: Which is the explanatory variable and which is the response variable? Explain.

Question 2: Describe the relationship between the happiness of a country and it’s ecological footprint.

Question 3: Is it appropriate to use Pearson’s correlation to quantify the relationship between these two variables? Explain.

Correlation Matrix

Now let’s change things up and try to find a variable that does a good job predicting happiness. One of the ways we can try to do this is look for variables that have a high correlation with Happiness. (This does not always work, but can be a good place to start).

Often times when we try to examine the correlation between many variables at the same time, it can be helpful to arrange them in a table or matrix. The output below computes the Pearson’s correlation coefficient between the variables Happiness, LifeExpectancy, and HDI. We can look at the intersection of variables in the table to see the corresponding correlation between them. The correlation between a variable and itself is always 1, so we don’t care about looking at these entries.

Happy %>% select(Happiness, LifeExpectancy, GDPperCapita) %>%
cor(use="complete.obs")

##                Happiness LifeExpectancy GDPperCapita
## Happiness      1.0000000      0.8334278    0.6976830
## LifeExpectancy 0.8334278      1.0000000    0.6662072
## GDPperCapita   0.6976830      0.6662072    1.0000000

Question 4: Of the two other variables in the correlation matrix, which has the strongest correlation with Happiness?

Predicting `Happiness` with `LifeExpectancy`

Let’s go ahead and try to use LifeExpectancy to predict Happiness. Use the scatterplot below and the correlation matrix above to answer the following questions.

# Scatterplot of Happiness vs. LifeExpectancy
ggplot(Happy, aes(x=LifeExpectancy, y=Happiness)) + geom_point()

Question 5: Which of these is the explanatory and which is the response variable? Explain.

Question 6: Describe the general relationship between LifeExpectancy and Happiness.

Question 7: Is it appropriate to use Pearson’s correlation to quantify the relationship between these two variables? Explain.

Question 8: State the value of the correlation between LifeExpectancy and Happiness. Interpret the value of this correlation.

Linear Regression

Next, we are going to fit the linear regression line to the Happiness and LifeExpectancy scatterplot we just saw a second ago. To do this, we are going to use the lm() function in R. It stands for Linear Model. The syntax that the function uses is variable1 ~ variable2 which means “predict variable1 using variable 2”, and then we also need to tell R which data set these variables are coming from. Take a second to read the code that makes the linear regression line below, then use the output to answer the following questions.

# Linear regression for HDI vs Happiness.
fit = lm(data=Happy, Happiness~LifeExpectancy)
fit

## 
## Call:
## lm(formula = Happiness ~ LifeExpectancy, data = Happy)
## 
## Coefficients:
##    (Intercept)  LifeExpectancy  
##        -1.1037          0.1035

# Plot regression on scatterplot
ggplot(Happy, aes(x=LifeExpectancy, y=Happiness)) + geom_point() +
  geom_smooth(method='lm', se=F) + 
  geom_label(x=55, y=8, label = paste("Predicted Happiness = -1.104 + 0.104*LifeExpectancy"))

Question 9: State the regression equation using the variable names.

Question 10: What is the value of the slope? Interpret the value of the slope in context.

Question 11: What is the value of the intercept? Would it be appropriate to interpret the y-intercept? If yes, interpret the value of the y-intercept. If not, explain why.

Question 12: What is the predicted happiness for a country that has a life expectancy of 77.9 years? Show your calculation.

Question 13: What is the value of the residual for the United States? Interpret the value of the residual. The value of the US’s LifeExpectancy and Happiness variables are:

Happy %>% filter(Country == "United States of America") %>% select(LifeExpectancy, Happiness)

##   LifeExpectancy Happiness
## 1           77.9       7.9

Question 14: What is the value of the coefficient of determination (R^2) between the happiness of a country and its life expectancy? Interpret the value of R^2 (do not use correlation interpretation).

Linear Regression (Categorical Predictor)

We are going to use linear regression using a categorical predictor now. We will use a separate dataset for this, since the number of categories in the Happy Planet data set is too large to practice these ideas. Instead we will use the Iris dataset that we previously used for Homework 1. This dataset comes with the ggplot2 package. The following code shows the first few rows of the dataset.

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Predicting petal length with species

We are going to use the Species variable to help us predict values of Petal.Length for the iris flowers. Use the following linear regression results and corresponding graph to answer the following questions.

ggplot(data=iris, aes(x=Species, y=Petal.Length)) + geom_jitter(width=.2) + geom_smooth(method="lm") + geom_segment(aes(x = 0.7, xend = 1.3, y = 1.462, yend=1.462), 
               color = 'tomato', linewidth = 1) +
  geom_segment(aes(x = 1.7, xend = 2.3, y = 4.26, yend=4.26), 
               color = 'tomato', linewidth = 1) +
  geom_segment(aes(x = 2.7, xend = 3.3, y = 5.552, yend=5.552), 
               color = 'tomato', linewidth = 1)

lm(Petal.Length ~ Species, data=iris)

## 
## Call:
## lm(formula = Petal.Length ~ Species, data = iris)
## 
## Coefficients:
##       (Intercept)  Speciesversicolor   Speciesvirginica  
##             1.462              2.798              4.090

Question 15: What are the categories of Species? How many indicator variables could we make with this variable?

Question 15: What is reference variable in the linear regression output?

Question 16: Write the equation for the line in two different ways (see slide 10/18 from Cat. Predictor notes).

Question 17: What is the predicted petal length for a Setosa flower?

Question 18: What is the predicted petal length for a Versicolor flower?

Question 19: What is the predicted petal length for a Virginica flower?

Bonus (boxplots)

We have already seen something very similar to the jitter plot with group means above. It was the side-by-side boxplots in the Tables lab! In the side-by-side boxplots usually we care about comparing medians. When we use linear regression we are comparing the means of the groups instead.

ggplot(data=iris, aes(x = Species, y = Petal.Length)) + geom_boxplot()

Correlation and Regression Lab