This project is due (submitted to Gradescope) at 2:30pm Wed 4/01. It is meant to be an in-class project.
library(ggplot2)
library(dplyr)
data = read.table("https://nfriedrichsen.github.io/data/HappyPlanetIndex.txt", header=T, sep=",")
Happy = data.frame(data)[,-3] #drop redundant 3rd column
Happy$Region = as.factor(Happy$Region)
The goal of this project day is to explore the Happy Planet dataset and ultimately put into practice a few topics we have learned (or begun learning) this semester.
Copy the code below to load in the 2012 version of this dataset. (Note: The 2012 version of this dataset is in a much more user-friendly format)
One of the first things we should do when presented with a data set is to try to figure out what information is contained within it. In addition, it can be very helpful to understand why the data set was collected and who gathered the info.
Part A: Browse the following links (here) to get an understanding of the goals for the organization creating this “Happy Planet Index”. Why was the HPI created? What problem is it trying to solve?
Part B: More info about HPI is available on the website’s FAQ section (link here). What does HPI actually measure either mathematically or conceptually?
Part C: (In your own words) give at least two critiques or flaws of using GDP as a measure for the development of a country. This could be useful for helping us figure out why countries have different HPI values.
Part D: Looking at the Happy Planet data in R, briefly describe some of the other variables included. What constitutes an observation in this data set? What is the sample size of this data set?
We have previously talked about data collection impacting our inference. Big topics were related to random sampling and random assignment (experiments).
Part A: Does this dataset represent a random sample of countries? Why does this matter for how we analyze the data?
Part B: Even if all countries are included, does the data represent all individuals within those countries? Explain.
Part C: What sources of randomness or variability are present in this dataset? Be specific about how each variable (e.g., wellbeing, life expectancy, ecological footprint) is collected.
Part D: Based on your answers above, what types of conclusions can you reasonably make from this dataset? What types of conclusions would be inappropriate?
Part A: The following code creates a histogram of the HPI variable. Describe the the distribution (e.g., skewed, symmetric, presence of outliers).
Happy %>% ggplot(aes(x=HPI)) + geom_histogram(color = 'black', fill = 'gray', bins=12)
Part B: Calculate summary statistics for HPI (mean, median, standard deviation). How do these compare to what you observed in the histogram?
Part C: Identify at least one country with a relatively high or low HPI value and describe how it compares to the rest of the dataset.
Part D: Based on your analysis, what does a “typical” HPI value look like? Justify the type of measure you use for this.
Next our goal is going to explore the data more and see if HPI is different for different regions.
Part A: The way the Region variable is
recorded is not particularly helpful. Based on the dataset, how many
regions are there? List the region codes and describe any patterns you
notice in which countries fall into each group.
Part B: Create boxplots of HPI separated by region within one graph (the previous histogram code is a good place to start).
Part C: Which regions appear to have higher or lower HPI values?
Part D: Are there any regions with particularly large variability? How can you tell?
Part E: Do any regions have clear outliers? What might explain these?
Part F: Explain whether there is an association between region and HPI.
Based on your analysis, what conclusions can you reasonably make about HPI and regions? What conclusions would be inappropriate or too strong?