Karl Ho
School of Economic, Political and Policy Sciences
University of Texas at Dallas
Best subset and stepwise model selection procedures:
Let \(M_{0}\) denote the null model, which contains no predictors. This model simply predicts the constant or sample mean for each observation.
For \(k = 1,2,...p\):
Fit all \(p \choose k\) (pronounced \(p\) choose \(k\) ) models that contain exactly \(k\) predictors.
Pick the best among these \(p \choose k\) models, and call it \(M_{k}\). Here best is defined as having the smallest RSS, or equivalently largest \(R^2\).
Select a single best model from among \(M_{0}\), . . . , \(M_{p}\) using cross-validated prediction error, \(Cp (AIC)\), \(BIC\), or adjusted \(R^2\).
How many possible models one can choose from?
To find out how many models, it should be:
All the combinations?
\(\approx2^p\)
Say there are \(p=5\) predictors and we want to choose \(k=3\) predictors, the number of possible models is:
More predictors, lower the RSS
More predictors, higher the \(R^2\)
Irreducible error is the noise term in the true relationship that cannot fundamentally be reduced by any model.
Bias is the difference between the expected (or average) prediction of the model and the true value which is trying to predict.
Variance is the variability of a model prediction for a given data point.
\(Err(x) = \left(E[\hat{f}(x)]-f(x)\right)^2 + E\left[\left(\hat{f}(x)-E[\hat{f}(x)]\right)^2\right] +\sigma_e^2\)
\(Err(x) = Bias^2+ Variance + Irreducible Error\)
Source: Scott Foremann-Roe. 2012. Understanding the Bias-Variance Tradeoff (http://scott.fortmann-roe.com/docs/BiasVariance.html
Lowest prediction error
\(Err(x) = Bias^2+ Variance + Irreducible Error\)
$$RSS=\sum_{i=1}^{n}\left( y_i - \beta_0-\sum_{j=1}^{p}\beta_{j}x_{ij} \right)^2$$
$$\sum_{i=1}^{n}\left( y_i - \beta_0-\sum_{j=1}^{p}\beta_{j}x_{ij} \right)^2+\lambda\sum_{j=1}^{p}\beta_{j}^2=RSS+\lambda\sum_{j=1}^{p}\beta_{j}^2,$$
The effect of this equation is to add a penalty of the form
Where the tuning parameter is a positive value.
This has the effect of “shrinking” large values of \(\beta\)'s towards zero.
It turns out that such a constraint should improve the fit, because shrinking the coefficients can significantly reduce their variance
Notice that when \(\lambda=0\), we get the Ordinary Least Square (OLS).
$$\lambda\sum_{j=1}^{p}\beta_{j}^2=RSS+\lambda\sum_{j=1}^{p}\beta_{j}^2$$
As \(\lambda\) increases, the standardized coefficients shrinks towards zero.
The tuning parameter \(\lambda\) can be used in different methods to reduce bias at cost of slight bias.