The following is based on the article “To Explain or To Predict” published in 2010 by data scientist Galit Shmueli, PhD. A link to the article is here.
Abstract from the Article. Statistical modeling is a powerful tool for developing and testing theories by way of causal explanation, prediction, and description. In many disciplines there is near-exclusive use of statistical modeling for causal explanation and the assumption that models with high explanatory power are inherently of high predictive power. Conflation between explanation and prediction is common, yet the distinction must be understood for progressing scientific knowledge. While this distinction has been recognized in the philosophy of science, the statistical literature lacks a thorough discussion of the many differences that arise in the process of modeling for an explanatory versus a predictive goal. The purpose of this article is to clarify the distinction between explanatory and predictive modeling, to discuss its sources, and to reveal the practical implications of the distinction to each step in the modeling process.
The statistical models that we are learning about in class can actually serve two very different purposes. One purpose is explanation (understanding the underlying reasons a relationship exists). The other is prediction (forecasting what will happen for new or future observations). These two goals sound similar, but they lead to very different strategies, even though they often use the same tools like regression or ANOVA.
Explanatory modeling is driven by questions about why something happens. The focus is on identifying meaningful relationships that help us understand a process, behavior, or mechanism. For example, if we want to understand why some students perform better on exams, we might build a model to test whether study time, class attendance, or sleep patterns are associated with performance. The quality of an explanatory model is judged by how well the estimated coefficients match theoretical expectations, whether they reveal meaningful relationships, or whether the statistical assumptions (such as unbiasedness and correct model form) are justified. An explanatory model is useful when it improves our scientific understanding of the thing we are studying.
Predictive modeling, in contrast, focuses on using available data to make accurate guesses about new or unseen cases. A predictive model does not need to uncover the “true cause” of an outcome. A prediction problem might investigate: given a student’s study habits and coursework, what score are they likely to earn on the final exam? The emphasis shifts from interpreting coefficients to minimizing prediction errors. A model that predicts well may include variables with no causal meaning, may drop variables that are theoretically important but statistically unstable, or may use techniques that prioritize accuracy over interpretability.
The same statistical tools, such as linear regression or ANOVA, can be used for either explanation or prediction, but the user must be clear about the goal from the start. In explanatory work, researchers carefully justify which predictors belong in the model based on theory or prior studies. In predictive work, the same researcher might use methods that check how well the model performs on data not used to fit it. One simple approach is to set aside part of the dataset and only test the model on this unused portion (a “holdout sample”). Another approach, called “cross-validation,” repeatedly splits the data into training parts and testing parts to see how consistently the model performs on new information. These methods help evaluate a model’s ability to make accurate predictions, which is the main goal of predictive modeling. Even though researchers may use the same dataset and the same regression tool, the decisions they make and how they judge the model can differ depending on whether the goal.
Another key idea from the article is that a model that explains well does not necessarily predict well, and a model that predicts well may be poor for explanation. For example, a theoretically important variable may have a small effect in the sample, making it statistically insignificant; an explanatory model might keep it, while a predictive model might remove it to improve accuracy. A predictive model might use interaction terms or transformations that help forecasting but are difficult to interpret causally. These trade-offs illustrate why mixing the goals can create confusion or poor analysis. Sometimes researchers build models intended for explanation but evaluate them with predictive metrics, or they treat predictive accuracy as evidence of causation. Shmueli argues that this is a common source of misunderstanding in statistical practice.
This distinction matters for students learning regression and ANOVA, though we have actually explored both mindsets. When we fit a regression model in class, you can interpret the coefficients to understand how one variable relates to another (this is an explanatory mindset). But if your goal were to predict values outside of the dataset, you would approach the same model differently; you might worry less about the meaning of each coefficient and instead focus on the model’s overall predictive performance (such as \(R^2\) or looking at residuals). Being able to tell whether a model is built “to explain” or “to predict” helps clarify what the results mean, how to judge the model’s success, and how to communicate its conclusions responsibly.