Social and Political Data Science: Introduction

### Knowledge Mining

Karl Ho

School of Economic, Political and Policy Sciences

University of Texas at Dallas

# Linear Model Selection

## Linear models

• ### Better prediction, more accurate

• $$p>n$$, more predictors/features than sample size

## Subset selection

Best subset and stepwise model selection procedures:

1. Let $$M_{0}$$ denote the null model, which contains no predictors. This model simply predicts the constant or sample mean for each observation.

2. For $$k = 1,2,...p$$:

1. Fit all $$p \choose k$$ (pronounced $$p$$ choose $$k$$ ) models that contain exactly $$k$$ predictors.

2. Pick the best among these $$p \choose k$$ models, and call it $$M_{k}$$. Here best is defined as having the smallest RSS, or equivalently largest $$R^2$$.

3. Select a single best model from among $$M_{0}$$, . . . , $$M_{p}$$ using cross-validated prediction error, $$Cp (AIC)$$, $$BIC$$, or adjusted $$R^2$$.

## Subset selection

How many possible models one can choose from?

To find out how many models, it should be:

## Subset selection

All the combinations?

$$\approx2^p$$

### $${p \choose k} \approx \frac{p!}{k!(p-k)!}$$

Say there are $$p=5$$ predictors and we want to choose $$k=3$$  predictors, the number of possible models is:

## Illustration: Credit models

More predictors, higher the $$R^2$$

## Backward Stepwise Selection

### Three types of prediction error (PE)

• Irreducible error is the noise term in the true relationship that cannot fundamentally be reduced by any model.

• Bias is the difference between the expected (or average) prediction of the model and the true value which is trying to predict.

• Variance is the variability of a model prediction for a given data point.

$$Err(x) = \left(E[\hat{f}(x)]-f(x)\right)^2 + E\left[\left(\hat{f}(x)-E[\hat{f}(x)]\right)^2\right] +\sigma_e^2$$

$$Err(x) = Bias^2+ Variance + Irreducible Error$$

Source: Scott Foremann-Roe. 2012. Understanding the Bias-Variance Tradeoff (http://scott.fortmann-roe.com/docs/BiasVariance.html

### Scenarios

Lowest prediction error

$$Err(x) = Bias^2+ Variance + Irreducible Error$$

### One Standard Error rule

• The cross-validation errors were computed using k = 10 folds. In this case, the validation and cross-validation methods both result in a six-variable model.
• However, all three approaches suggest that the four-, five-, and six-variable models are roughly equivalent in terms of their test errors.

### One Standard Error rule

• if a set of models appear to be more or less equally good, then we choose the simplest model—that is, the model with the smallest number of predictors.
• We first calculate the standard error of the estimated test MSE for each model size, and then select the smallest model for which the estimated test error is within one standard error of the lowest point on the curve.

## Ridge regression

$$RSS=\sum_{i=1}^{n}\left( y_i - \beta_0-\sum_{j=1}^{p}\beta_{j}x_{ij} \right)^2$$

• ### Ridge Regression uses a slightly different equation

$$\sum_{i=1}^{n}\left( y_i - \beta_0-\sum_{j=1}^{p}\beta_{j}x_{ij} \right)^2+\lambda\sum_{j=1}^{p}\beta_{j}^2=RSS+\lambda\sum_{j=1}^{p}\beta_{j}^2,$$

• The effect of this equation is to add a penalty of the form

• Where the tuning parameter is a positive value.
This has the effect of “shrinking” large values of  $$\beta$$'s towards zero.

• It turns out that such a constraint should improve the fit, because shrinking the coefficients can significantly reduce their variance

• Notice that when  $$\lambda=0$$, we get the Ordinary Least Square (OLS).

## Ridge regression

$$\lambda\sum_{j=1}^{p}\beta_{j}^2=RSS+\lambda\sum_{j=1}^{p}\beta_{j}^2$$

• As  $$\lambda$$  increases, the standardized coefficients shrinks towards zero.

## Ridge regression

• The tuning parameter $$\lambda$$ can be used in different methods to reduce bias at cost of slight bias.