Concursos
OLS regression of a variable Y on X is likely to yield a SPURIOUS statistically significant result with high R-squared
Running an OLS regression of a variable Y on X is likely to yield a spurious statistically significant result with a high R-squared if Y and X are non-stationary and highly correlated due to some underlying trend or common factor, but not causally related.
Here’s why:
- Non-stationarity: If both Y and X are non-stationary, they may exhibit trends over time (for example, a random walk or a deterministic trend). When regressing one non-stationary variable on another, the regression might find a statistically significant relationship, even if there is no real causal connection between them. This is because the two variables may be moving together due to their shared trend, rather than any meaningful relationship.
- Spurious correlation: Non-stationary variables can lead to a spurious correlation. Essentially, because both variables are trending over time, they might appear to be related, but this relationship is not due to causality, but rather the shared time-dependent trend.
- High R-squared: In such cases, the regression might show a high R-squared value, suggesting that a significant portion of the variation in Y is explained by X, even though the relationship is meaningless from a causal perspective.
Example:
- Suppose both Y and X represent stock prices or GDP over time. They may both be trending upwards over time due to general economic growth, and a regression could show a high R-squared, but this relationship is spurious because there is no underlying causal mechanism linking X and Y directly.
To avoid this problem, you would typically check for stationarity (using tests like the Augmented Dickey-Fuller test) and possibly difference the variables or use techniques like cointegration analysis when working with non-stationary time series data.