Homework 7

Due: Sunday 5/10 at 10pm

This homework is primarily focused on study design, correlation, and regression concepts.

Question 1 – Conceptual (1 pt each)

Part A (Study Design) What do we need to be true of a sample in order to generalize results to a population?
Part B (Study Design) Why does randomization of treatments to experimental units let us make causal conclusions?
Part C What form of relationship does Pearson’s correlation measure?
Part D If a scatterplot has a value of r = .9, does that mean there must be a linear relationship between the variables? Explain.
Part E What does the phrase ‘correlation \(\neq\) causation’ mean in your own words?
Part F What form of relationship needs to exist between quantitative variables to perform linear regression?
Part G Explain what extrapolation is and why we should probably try to avoid it.

Question 2 – Correlation Practice

Match each correlation to its corresponding plot. Then for each plot also identify the strength of the relationship and whether fitting a linear model would be reasonable.

Question 3 – Cherry Trees

The dataset below includes information on 31 black cherry trees felled in the Allegheny National Forest, Pennsylvania. For each tree, it includes three variables, one for each diameter (in), height (ft), and volume (cubic ft).

## Cherry tree data
cherry <- read.csv("https://collinn.github.io/data/cherry.csv")

Part A: Create two scatterplots of the data comparing diameter with volume and height with volume, in each case letting volume be the response variable. Based on these plots, which variable do you think would be a better predictor of volume?

Part B: Create two linear models, ones for each of the plots created in Part A (that is, with volume as a response variable in both models). Based on the summary() output, which of these models has a higher \(R^2\) value? Is this consistent with what you decided in Part A?

Part C: Using the model with the highest \(R^2\) in Part B, write the linear equation for predicting a tree’s volume. Interpret both the slope and the intercept. Is the intercept meaningful in this case?

Part D: In 1 or two sentences, generally describe what the model in part C is doing.

Part E: Based on your \(R^2\) value for model in Part C, do you think a model using both variables will perform much better than the regression model just using one predictor?

Part F: Make a regression model using both predictors. Interpret the coefficients. Do you think the increase in \(R^2\) warrants the increase in model complexity adding another variable?

Question 4 – Cat Regression

The problem includes a dataset with 144 cats, included with each observation is the sex of the cat, as well as body weight (kg) and heart weight (g).

## Read in cat data
cats <- read.csv("https://collinn.github.io/data/cats.csv")

Part A: Use lm() to create a linear model in R predicting the weight of a cat’s heart using body weight as an explanatory variable. Write the formula for the regression line in context.

Part B: Interpret the slope in context.

Part C: Interpret the intercept in context. Is the intercept meaningful?

Part D: What is the predicted heart weight of a cat that has a body weight of 3kg?

Part E: What is the residual for a cat with a body weight of 3kg that actually has a heart weight of 12g. Is this an under- or over-prediction?

Part F: Create a linear model in R predicting the weight of a cat’s heart using the cat’s sex as an explanatory variable. Write the formula for the regression equation. Interpret the coefficients.

Part G: Create a third linear model, this time including the cat’s sex in addition to body weight to predict heart weight. How do we interpret the intercept in this model?

Part F: Using the model from part C, what heart weight would you predict for:

a male cat with a body weight of 3.2kg
a female cat with a body weight of 2.4kg

Question 5

Question 6

Write the equation of the regression line for predicting travel time.
Interpret the slope and the intercept in this context.
Calculate \(R^2\) of the regression line for predicting travel time from distance traveled for the Coast Starlight, and interpret \(R^2\) in the context of the application.
The distance between Santa Barbara and Los Angeles is 103 miles. Use the model to estimate the time it takes for the Starlight to travel between these two cities.
It takes the Coast Starlight about 168 mins to travel from Santa Barbara to Los Angeles. Calculate the residual and explain the meaning of this residual value.
Suppose Amtrak is considering adding a stop to the Coast Starlight 500 miles away from Los Angeles. Would it be appropriate to use this linear model to predict the travel time from Los Angeles to this point?
(Challenge) What would happen to the slope coefficient if distance was measured in miles instead of km?

Question 7

If the \(R^2\) for the least-squares regression line for these data is 72%, what is the correlation between lunch and helmet?
Calculate the slope and intercept for the least-squares regression line for these data.
Interpret the intercept of the least-squares regression line in the context of the application.
Interpret the slope of the least-squares regression line in the context of the application.
What would the value of the residual be for a neighborhood where 40% of the children receive reduced-fee lunches and 40% of the bike riders wear helmets? Interpret the meaning of this residual in the context of the application.

Question 8

Describe the relationship between meat consumption and life expectancy.
Why do you think the variables are positively associated?
Is the relationship between meat consumption and life expectancy stronger, similar, or weaker when broken down by income bracket in the separate plots along the bottom (as compared with the relationship when combined in the top left figure)?