Social and Political Data Science: Introduction

Knowledge Mining

Karl Ho

School of Economic, Political and Policy Sciences

University of Texas at Dallas

Linear Model Selection

Linear models:

  • Simple

  • Interpretable

  • Great for prediction

  • Easy to fit

Linear models


  • Better prediction, more accurate

    • \(p>n\), more predictors/features than sample size
  • Model interpretability

Alternatives to Least Squares

  1. Subset Selection

  2. Shrinkage (Regularization)

  3. Dimension Reduction

Three classes of methods

Identify a subset of the \(p \) predictors that we believe to be related to the response. We then fit a model using least squares on the reduced set of variables.

Subset selection

Best subset and stepwise model selection procedures:

  1. Let \(M_{0}\) denote the null model, which contains no predictors. This model simply predicts the constant or sample mean for each observation.

  2. For \(k = 1,2,...p\):

    1. Fit all \(p \choose k\) (pronounced \(p\) choose \(k\) ) models that contain exactly \(k\) predictors.

    2. Pick the best among these \(p \choose k\) models, and call it \(M_{k}\). Here best is defined as having the smallest RSS, or equivalently largest \(R^2\).

  3. Select a single best model from among \(M_{0}\), . . . , \(M_{p}\) using cross-validated prediction error, \(Cp (AIC)\), \(BIC\), or adjusted \(R^2\).

Subset selection

How many possible models one can choose from?

To find out how many models, it should be:

Subset selection

All the combinations?


$$  {p \choose k} \approx \frac{p!}{k!(p-k)!}   $$

Say there are \(p=5\) predictors and we want to choose \(k=3\)  predictors, the number of possible models is:

$$  {5 \choose 3} \approx \frac{5!}{3!(5-3)!}   $$

Illustration: Credit models

More predictors, lower the RSS

More predictors, higher the \(R^2\)

  • Forward stepwise selection begins with a model containing no predictors, and then adds predictors to the model, one-at-a-time, until all of the predictors are in the model.

  • In particular, at each step the variable that gives the greatest additional improvement to the fit is added to the model.

  • Nested model, plus additional one.

  • Small number of models: \(1+p+(p-1)+......\) \(\approx p^2\)

Forward Stepwise Selection

  • Best subset selection is computationally expensive but not necessarily cost effective 

  • Plus, an enormous search space can lead to overfitting and high variance of the coefficient estimates.

  • Stepwise methods, which explore a far more restricted set of models, are attractive alternatives to best subset selection.

Forward Stepwise Selection

Forward Stepwise Selection

The first four selected models for best subset selection and forward stepwise selection on the Credit data set. The first three models are identical but the fourth models differ.

  1. Let \(M_{p}\) denote the full model, which contains all no predictors.

  2. For \(k=0,..,p−1:\)

    1. Consider all \(p-k\) models that augment the predictors in \(M_{k}\) with one additional predictor.

    2. Choose the best among these \(p-k\)  models, and call it \(M_{k+1}\). Here best is defined as having smallest \(RSS\) or highest \(R^2\).

Forward Stepwise Selection

  • Like forward stepwise selection, backward stepwise selection provides an efficient alternative to best subset selection.

  • However, unlike forward stepwise selection, it begins with the full least squares model containing all \(p\) predictors, and then iteratively removes the least useful predictor, one-at-a-time.

Backward Stepwise Selection

  1. Let \(M_{p}\) denote the full model, which contains all \(p\) predictors.

  2. For \(k=p,p−1,...,1:\)

    1. Consider all \(k\) models that contain all but one of the
      predictors in \(M_{k}\) , for a total of \(k − 1\) predictors.

    2. Choose the best among these \(k\)  models, and call it \(M_{k-1}\). Here best is defined as having smallest \(RSS\) or highest \(R^2\).

Backward Stepwise Selection

  • For both forward and backward stepwise selections:

    Select a single best model from among \(M_0\), . . . , \(M_p\) using cross-validated prediction error, \(C_p (AIC), BIC\), or
    adjusted \(R^2\).

Model Selection

  • Like forward stepwise selection, the backward selection approach searches through only \(1 + p(p + 1)/2\) models, and so can be applied in settings where \(p\) is too large to apply best subset selection

  • Like forward stepwise selection, backward stepwise selection is not guaranteed to yield the best model containing a subset of the \(p\) predictors.

Backward Stepwise Selection

  • Backward selection requires that the number of samples \(n  is  larger  than  the  number  of  variables  p\) (so that the full model can be fit). In contrast, forward stepwise can be used even when \(n < p\), and so is the only viable subset method when \(p\) is very large.

Backward Stepwise Selection

  • The model containing all of the predictors will always have the smallest \(RSS\) and the largest \(R^2\), since these quantities are related to the training error.

  • We choose a model with low test error, not a model with low \(training   error\). Training error is usually a poor estimate of test error.

  • Hence, \(RSS\) and \(R^2\) are not suitable for selecting the best model among a collection of models with different numbers of predictors.

Model Selection

Three types of prediction error (PE)

  • Irreducible error is the noise term in the true relationship that cannot fundamentally be reduced by any model. 

  • Bias is the difference between the expected (or average) prediction of the model and the true value which is trying to predict.

  • Variance is the variability of a model prediction for a given data point.

\(Err(x) = \left(E[\hat{f}(x)]-f(x)\right)^2 + E\left[\left(\hat{f}(x)-E[\hat{f}(x)]\right)^2\right] +\sigma_e^2\)

\(Err(x)   =  Bias^2+        Variance   +          Irreducible  Error\)

Bias-variance tradeoff

Source: Scott Foremann-Roe. 2012. Understanding the Bias-Variance Tradeoff (
  • If \(n \gg p\) (read as \(n\) much greater than \(p\))
    that is, if \(n\), the number of observations, is much larger than \(p\), the number of variables, then the least squares estimates tend to also have low variance, and hence will perform well on test observations.


  • If \(n\) is not much greater than \(p\))
    there can be a lot of variability in the least squares fit, resulting in overfitting and consequently poor predictions on future observations not used in model training.


  • If \(n\) < \(p\)
    there is no longer a unique least squares coefficient estimate:

    the variance is \(infinite\) so the method cannot be used at all.


Bias-variance tradeoff

Lowest prediction error

\(Err(x)   =  Bias^2+        Variance   +          Irreducible  Error\)

$$ C_p=\dfrac{1}{n} (RSS+2d\hat{\sigma}^2) $$

  • where \(d\) is the total # of parameters used and \(\hat{\sigma}^2\) is an estimate of the variance of the error \(\epsilon\) associated with each response measurement.

Mallow’s \(C_p\):

 is defined for a large class of models fit by maximum likelihood:

$$ AIC = −2 log L + 2 · d $$
where \(L\) is the maximized value of the likelihood function for the estimated model.

Akaike Information criterion (AIC)

$$ BIC=\dfrac{1}{n} (RSS+log(n)d\hat{\sigma}^2) $$

  • Like \(C_p\), the \(BIC\) will tend to take on a small value for a model with a low test error, and so generally we select the model that has the lowest \(BIC\) value.

  • Notice that \(BIC\)  replaces the \(2d\hat{\sigma}^2\) used by \(C_p\) with a \(log(n)d\hat{\sigma}^2\) term, where \(n\) is the number of observations.

Bayesian Information criterion (BIC)

Since \(log  n > 2\) for any \(n > 7\), the \(BIC\) statistic generally places a heavier penalty on models with many variables, and hence results in the selection of smaller models than \(C_p\). 

Bayesian Information criterion (BIC) (continued)

  • For a least squares model with d variables, the adjusted \(R^2\) statistic is calculated as:

$$ Adjusted  R^2 = 1−\frac{RSS/(n−d−1)}{TSS/(n − 1)} $$
    where \(TSS\) is the total sum of squares.

  • Unlike \(Cp, AIC, and  BIC\), for which a small value indicates a model with a low test error, a large value of adjusted \(R^2\) indicates a model with a small test error.

Adjusted \(R^2\)

  • Maximizing the adjusted \(R^2\) is equivalent to minimizing RSS n−d−1. While RSS always decreases as the number of variables in the model increases, \(RSS\) may increase or \(n−d−1\) decrease, due to the presence of \(d\) in the denominator.

  • Unlike the \(R^2\) statistic, the adjusted \(R^2\) statistic pays a price for the inclusion of unnecessary variables in the model.

Adjusted \(R^2\)


regsubsets plots

One Standard Error rule

  • The cross-validation errors were computed using k = 10 folds. In this case, the validation and cross-validation methods both result in a six-variable model.
  • However, all three approaches suggest that the four-, five-, and six-variable models are roughly equivalent in terms of their test errors.

One Standard Error rule

  • if a set of models appear to be more or less equally good, then we choose the simplest model—that is, the model with the smallest number of predictors.
  • We first calculate the standard error of the estimated test MSE for each model size, and then select the smallest model for which the estimated test error is within one standard error of the lowest point on the curve.
  • Returns a sequence of models \(M_k\) indexed by model size \(k = 0,1,2,....\), then select \(\hat{k}\),  return model \(M_{\hat{k}}\).

  • Compute the validation set error or the cross-validation error for each model \(M_k\), then select the \(k\)  smallest test error.

  • This procedure has an advantage over \(AIC, BIC,\) \(C_p\), and \(adjusted  R^2\) with  direct estimate of test error, but not error variance \(\sigma^2\).

  • Can be applied in cases where error variance \(\sigma^2\) is hard to estimate.

Validation and Cross-Validation

Fit a model involving all \(p\) predictors, but the estimated coefficients are shrunken towards zero relative to the least squares estimates. This shrinkage (also known as regularization) has the effect of reducing variance and can also perform variable selection.


The shrinkage methods can be applied to very large data where the number of variables might be in the thousands or even millions. 

  • By constraining or shrinking the estimated coefficients, we can often substantially reduce the variance at the cost of a negligible increase in bias. 


  • When number of predictors is larger than sample size, \(p>n\)

  • Multicollinearity, i.e. predictors are highly collinear

Ridge regression

  • Least Square estimates  by minimizing

Ridge regression

$$RSS=\sum_{i=1}^{n}\left( y_i - \beta_0-\sum_{j=1}^{p}\beta_{j}x_{ij} \right)^2$$

  • Ridge Regression uses a slightly different equation

$$\sum_{i=1}^{n}\left( y_i - \beta_0-\sum_{j=1}^{p}\beta_{j}x_{ij} \right)^2+\lambda\sum_{j=1}^{p}\beta_{j}^2=RSS+\lambda\sum_{j=1}^{p}\beta_{j}^2,$$

  • The effect of this equation is to add a penalty of the form


  • Where the tuning parameter is a positive value.
    This has the effect of “shrinking” large values of  \(\beta\)'s towards zero.

  • It turns out that such a constraint should improve the fit, because shrinking the coefficients can significantly reduce their variance

  • Notice that when  \(\lambda=0\), we get the Ordinary Least Square (OLS).

Ridge regression


  • As  \(\lambda\)  increases, the standardized coefficients shrinks towards zero.

Ridge regression

  • The tuning parameter \(\lambda\) can be used in different methods to reduce bias at cost of slight bias.

Tuning parameter

Project \(p\) predictors into a \(M\)-dimensional subspace, where \(M < p\). This is achieved by computing \(M\) different linear combinations, or projections, of the variables. Then these \(M\) projections are used as predictors to fit a linear regression model by least squares.

Dimension Reduction