“All models are wrong, but some are useful.” - George Box
A statistical model is a mathematical representation of some real-world phenomena. We may use equations or probability concepts to help us explain how something works.
The basic premise is that we want to simplify the complicated real example to something easier to understand or work with. We may want to see how certain variables are related, calculate likelihood of rare events, or even make predictions.
Some examples may be useful. A mathematical model can be used to understand the relationship between how long it takes an object to fall from a particular height given strength of gravity.
\(t = \sqrt{\frac{2h}{g}}\) where t = time to fall, h = height, g = gravity strength.
In reality, this is over-simplified as it ignores air resistance, wind, and strength of gravity changing as the distance from earth’s core changes, but it will still give us very accurate estimates in nearly all situations we encounter in our lives.
One of the distinctions of statistical models from mathematical models in general is that statistical models are not deterministic. Outcomes can change if we re-run a scenario. Much of what you have seen in 209 are various statistical models. I will outline some of them below, obviously not all will be appropriate for the project.
The most basic of statistical models. These simplify scenarios so we can understand the likelihood of particular outcomes of a random process. Probability models specify outcomes, likelihood, and can be used to calculate probabilities of related but more complex scenarios or used to talk about ‘typical’ occurrences and how “swingy” results are (variability).
With our limited information we cannot perfectly predict coin flips, and so they are a random process. A common model for representing this is the Binomial probability distribution.
Let n = the number of independent trials in a process
and let p = the probability of a particular outcome defined
as “Success”. The Binomial probability model then lets us calculate
probabilities of getting a particular number of “Successes.” Let
k be the particular number of “Successes” you are
interested in finding the probability for.
\(P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k = 0, 1, 2, \dots, n\) where \(\binom{n}{k} = \frac{n!}{k!(n-k)!}\)
For a fair coin with 100 flips, the probability of getting 50 heads exactly is
\(P(X = 50) = \binom{100}{50}(.5)^{50}*(.5)^{50}\)
dbinom(50, size=100, prob=0.5)
## [1] 0.07958924
The probability of getting 50 or less is \(\sum_{k=0}^{50}P(X = k)\).
probability = 0
for (i in 0:50) {
probability = probability + dbinom(i, size=100, prob=0.5) }
probability
## [1] 0.5397946
# shortcut cumulative probability function
pbinom(50, size=100, prob=0.5)
## [1] 0.5397946
We may want to use a probability distribution to analyze agricultural examples, like for a farm what is the probability they get a particular yield per acre? This could be important because there may be some break-even point below which the farm actually loses money. It would be helpful for farmers to know based on their costs the probability they actually lose money for that particular year (not good!).
In order to specify a probability model to use we need to think about how the distribution of corn yields should look. Maybe most farms have yields very close to some average, with a small percentage having really high or really low yields. This could make the Normal distribution reasonable.
Suppose we have information collected that suggests crop yields can be represented as a Normal distribution with mean 182.545 bushels per acre and standard deviation 24.49. Then we can use the \(N(182.545, 24.49)\) to give us probabilities. (values taken from paper here)
A particular farm needs a yield of 170 bushels per acre to break-even (not lose money). What is the probability they lose money?
A different farm needs a yield of 140 bushels per acre to break-even. What is the probability they lose money?
The farmers in the first example requiring 170 bpa have a relatively high chance of losing money, making this a particularly risky year. In the second example with 140 bpa required, there is still some risk of losing money but nowhere near as likely.
A regression model allows us to relate many variables together into a prediction equation. Specifically, it shows how explanatory variables influence a quantitative response variable.
\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k + \varepsilon \]
\[ \hat{Y} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k \]
Goal: Estimate coefficients using data. Coefficient values will tell us about relationships between predictor variables and the response. Overall model-fit p-values tell us if there is a statistical discernible relationship between variables and the response.
The Normal distribution use for corn yields is greatly simplified and does not allow us to account for particular things at a farm that may factor into that farm’s corn yield for the year. Some things that could affect yield (bushels per acre) are rainfall ( mm), fertilizer amount (kg/acre), and soil quality rated 1-10, and temperature in Celsius (among others).
Suppose I wanted to build a multiple regression model to handle this scenario for a population of farm plots. It would look like the following:
\[ \widehat{Yield} = \beta_0 + \beta_1 \times \text{(rain)} + \beta_2 \times \text{(fertilizer)} + \beta_3 \times \text{(soil qual)} + \beta_4 \times \text{(temp)} \]
Once we have data we can use those to make estimates of the population slope coefficients. This gives us the following model when fitted to actual data:
\[ \widehat{Yield} = \beta_0 + \hat{\beta_1} \times \text{(rain)} + \hat{\beta_2} \times \text{(fertilizer)} + \hat{\beta_3} \times \text{(soil qual)} + \hat{\beta_4} \times \text{(temp)} \]
I have some (simulated) data that shows possible regression output below. Note: Yield values do not match last example. Also, as an aside: It’s really hard to find good examples of this in practice because any alteration to the model completely changes coefficients.
# Fit a multiple linear regression model
model <- lm(yield ~ rainfall + temp + soil_quality + fertilizer, data = df)
# predicted vs observed values
ggplot(df, aes(x = pred, y = yield)) +
geom_point(alpha = 0.6) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
labs(
title = "Predicted vs. Observed Corn Yield",
x = "Predicted yield (bushels per acre)",
y = "Observed yield (bushels per acre)"
) +
theme_minimal()
# coefficients and p-values
summary(model)
##
## Call:
## lm(formula = yield ~ rainfall + temp + soil_quality + fertilizer,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.68211 -0.34000 0.01498 0.37900 1.35843
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.526e+02 5.028e-01 303.529 < 2e-16 ***
## rainfall 3.987e-02 3.995e-04 99.807 < 2e-16 ***
## temp -1.121e-02 1.609e-02 -0.697 0.487
## soil_quality 2.070e-01 2.356e-02 8.784 8.1e-16 ***
## fertilizer 3.050e-02 1.446e-03 21.095 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5518 on 195 degrees of freedom
## Multiple R-squared: 0.982, Adjusted R-squared: 0.9816
## F-statistic: 2662 on 4 and 195 DF, p-value: < 2.2e-16
Substituting in our coefficient values our regression model now has the form:
\[ \widehat{Yield} = 152.6 + 0.03987 \times \text{(rain)} + 0.0305 \times \text{(fertilizer)} + 0.207 \times \text{(soil qual)} - 0.0112 \times \text{(temp)} \]
This lets us see how variables influence the yield. For example with all other variables at 0, we get a predicted yield of 152.6 bpa. Reasonable? No. For every mm increase in rainfall the predicted yield goes up 0.03987 bpa. Temp has a negative relationship with yield according to this data.
The overall model fit has an F-stat of 2662 on 4 and 195 df which is testing:
\(H_0\): \(\beta_i = 0\) for i = 1, \(\dots\), 4
There is overwhelming evidence that at least one of these slope coefficients is non-zero, i.e., that at least one variable amongst this set is a useful predictor of yields, and so collectively they seem useful. If we look at the individual coefficients, the temp slope is very small and so has a correspondingly large p-value. Temp doesn’t actually seem to matter much.
Our adj-\(R^2\) is very high in this data set (\(\approx\) .982), suggesting most of the variations in yields are explained by this regression model using our predictors.
We may run into a scenario when we want to classify data into one of two groups. This would go back to working with probabilities like we did in 209. We could predict P(belongs in group A) and P(belongs in group B) but one of these will be the complement of each other with only two groups. So we really only need one of these.
Multiple regression does not always work well for probabilities. Since we are making predictions, and we have seen in 209 that we can sometimes get negative values, this could royally mess up probability predictions.
Instead of predicting probabilities directly, a transformation is used to bound results between 0 and 1. And we will use notation to appropriate describe our response as a probability i.e., \(\widehat{p}\). The following is the transformation applied (log base is e).
\[ log(\frac{p}{1-p}) = \beta_0 + \dots + \beta_k X_k \implies p = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k)}} \]
Either of these forms can be used with different interpretations of coefficients.
The log of the odds is equal to \(\beta_0\) when all other variables are held constant. A one unit increase in \(X_i\) gives an increase in log-odds of \(\beta_i\). Equivalently: When all predictors are zero the predicted probability p = \(\frac{1}{1-e^{-\beta_0}}\). Other Slope coefficient effects directly on p are… much more complicated as the function is non-linear and depend on the current value of p when making incremental changes.
Most often we can make a plot that allows for classification. Define Success and Failure events. Since p is the probability of success, we will predict a success in our regression model for a data point if our predicted p is 0.5 or greater. If less than 0.5 we will predict a failure.
Suppose we are instead trying to predict whether a farm will produce 180 or more bushels per acre. This is now binary. Let p be the probability of success.
The following code runs a logistic regression on the data for the MLR corn yield example.
df = df %>% mutate(high_yield = ifelse(yield > 180, 1, 0))
logit_mod <- glm(high_yield ~ rainfall + temp + soil_quality + fertilizer,
data = df, family = binomial)
summary(logit_mod)
##
## Call:
## glm(formula = high_yield ~ rainfall + temp + soil_quality + fertilizer,
## family = binomial, data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -140.95715 39.59454 -3.560 0.000371 ***
## rainfall 0.18936 0.05130 3.691 0.000223 ***
## temp 0.31431 0.30444 1.032 0.301883
## soil_quality 1.07101 0.45928 2.332 0.019706 *
## fertilizer 0.16075 0.05219 3.080 0.002069 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 250.748 on 199 degrees of freedom
## Residual deviance: 19.587 on 195 degrees of freedom
## AIC: 29.587
##
## Number of Fisher Scoring iterations: 10
Again, coefficients describe changes in log-odds prediction, not additive change effects.
The base probability of success and odds of success for all variables equal to zero are respectively:
# base probability of success
1/(1+exp(-(-140.95715)))
## [1] 6.068587e-62
# base odds of success
exp(-140.95715)
## [1] 6.068587e-62
Overall the variable useful matches the MLR from above as temp still doesn’t seem that important. A positive coefficient indicates that increases in the variable are associated with increase in log-odds of getting more than 180 bpa (good!), negative slope coefficients decrease the log-odds (bad!). The rainfall coefficient of 0.18936 tells us that each mm of rainfall increases the log-odds of success by 0.18936 or equivalently the odds of success increases by a multiple of roughly 1.21 for each mm of rainfall.
exp(0.18936)
## [1] 1.208476
The following code makes a graph of the actual probabilities of success vs rainfall amount.. The curve is often called a logistic curve and makes a stretched-S pattern. Note we can only plot probability of success vs one predictor at a time.
# Generate prediction grid for rainfall, holding other vars constant
rain_grid <- tibble(
rainfall = seq(min(df$rainfall), max(df$rainfall), length.out = 200),
temp = mean(df$temp),
soil_quality = mean(df$soil_quality),
fertilizer = mean(df$fertilizer)
)
# Predicted probabilities
rain_grid <- rain_grid %>%
mutate(pred_prob = predict(logit_mod, newdata = ., type = "response"))
# Plot observed data and fitted logistic curve
ggplot() +
geom_point(data = df, aes(x = rainfall, y = high_yield),
alpha = 0.4, position = position_jitter(height = 0.05), color = "steelblue") +
geom_line(data = rain_grid, aes(x = rainfall, y = pred_prob),
color = "firebrick", size = 1.2) +
labs(
title = "Predicted Probability of High Corn Yield by Rainfall",
x = "Rainfall (mm)",
y = "Predicted Probability of High Yield ( > 180 bpa )"
) +
theme_minimal(base_size = 14)