Social and Political Data Science: Introduction

Knowledge Mining

Karl Ho

School of Economic, Political and Policy Sciences

University of Texas at Dallas

Supervised Learning: Resampling methods

Resampling involves repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model. 


Such an approach allows us to obtain information that would not be available from fitting the model only once using the original training sample.

These methods refit a model of interest to samples formed from the training set, in order to obtain additional information about the fitted model.


Resampling provide estimates of test-set prediction error, and the standard deviation and bias of our parameter estimates

Test error is the average error that results from using a statistical learning method to predict the response on new observations that were not used in training the method.

Test error

Training error is the error calculated by applying the statistical learning method to the observations used in its training.

Training error

The training error rate is often quite different from the test error rate, and the former can dramatically underestimate the latter.

Training- versus Test-Set Performance

Source: ISLR Figure 2.7, p. 25

Again, beware of the evil of overfitting!

Flexibility vs. Interpretability 

Source: ISLR Figure 2.7, p. 25

In general, as the flexibility of a method increases, its interpretability decreases.

Cross-validation estimates the test error associated with a given statistical learning method to evaluate performance, or to select the appropriate level of flexibility. 

Cross-validation (CV)

The process of evaluating a model’s performance is known as model assessment.

The process of selecting the proper level of flexibility for a model is called model selection. 

  1. Randomly divide the sample into two parts: a training set and a hold-out set (validation).

  2. Fit on the model with the training set to predict the responses for the observations in the validation set.

  3. The resulting validation-set error provides an estimate of the test error (using \(MSE\)). 

Validation-set approach

$$ MSE=\frac{1}{n} \sum_{i=1}^n(y_{i}-\hat{f}(x_{i}))^2 $$

Mean Squared Error (MSE)

Mean Squared Error is the extent to which the predicted response value for a given observation is close to
the true response value for that observation.

where \(\hat{f}(x_{i})\) is the prediction that \(\hat{f}\) gives for the \(i^{th}\) observation. 

The MSE will be small if the predicted responses are very close to the true responses, and will be large if for some of the observations, the predicted and true responses differ substantially.

Validation-set approach

Random splitting into two halves: left part is training set, right part is validation set

Illustration: linear vs. polynomial

Linear vs. Quadratic vs. Cubic 

Validation 10 times 

Conclusion: Quadratic form is the best.

  1. Validation estimate of test error rate can be highly variable, depending on included observations in the training set (extreme observations included??).

  2. Only a subset of the observations are used to fit the model, the validation set error rate may tend to overestimate the test error rate for the model fit on the entire data set.

Validation-set approach drawbacks

Leave-One-Out Cross-Validation

A set of \(n\) data points is repeatedly split into a training set containing all but one observation, and a validation set that contains only that observation. The test error is then estimated by averaging the \(n\) resulting \(MSE\)’s. The first training set contains all but observation 1, the second training set contains all but observation 2, and so forth.

K-fold Cross-validation

A set of \(n\) observations is randomly split into five non-overlapping groups. Each of these fifths acts as a validation set, and the remainder as a training set. The test error is estimated by averaging the five resulting \(MSE\) estimates.

K-fold Cross-validation

LOOCV sometimes useful, but typically doesn’t shake up the data enough. The estimates from each fold are highly correlated and hence their average can have high variance. 

Consider a simple classifier applied to some two-class data:

  1. Starting with 5000 predictors and 50 samples, find the 100 predictors having the largest correlation with the class labels.

  2. We then apply a classifier such as logistic regression, using only these 100 predictors.

Proper way of cross-validation

  • Can we apply cross-validation in step 2, forgetting about step 1?

  • This would ignore the fact that in Step 1, the procedure has already seen the labels of the training data, and made use of them. This is a form of training and must be included in the validation process.

  • It is easy to simulate realistic data with the class labels independent of the outcome, so that true test error =50%, but the CV error estimate that ignores Step 1 is zero!

Proper way of cross-validation

  • Repeatedly sampling observations from the original data set.

  • Randomly select \(n\) observations with replacement from the data set. 

  • In other words, the same observation can occur more than once in the bootstrap data set.


The use of the term \(bootstrap\) derives from the phrase to pull oneself up by one’s bootstraps (improve one's position by one's own efforts), widely thought to be based on one of the 18th century “The Surprising Adventures of Baron Munchausen” by Rudolph Erich Raspe:


The Baron had fallen to the bottom of a deep lake. Just when it looked like all was lost, he thought to pick himself up by his own bootstraps.


Each bootstrap data set contains \(n\) observations, sampled with replacement from the original data set. Each bootstrap data set is used to obtain an estimate of \(\alpha\)

Bootstrap vs. Jackknife

Bootstrap: replacement resampling

Jackknife (John Tukey) is a resampling technique especially useful for variance and bias estimation

Jackknife: resampling by leaving out observation

[1,2,3,4,5], [2,5,4,4,1], [1,3,2,5,5],......
[1,2,3,4,5], [2,5,4,1], [1,2,5],......
  • Suppose that we invest a fixed sum of money in two financial assets that yield returns of X and Y , respectively, where X and Y are random quantities (Portfolio dataset).

  • A fraction \(\alpha\) of the money invested in X, and  the remaining \(1-\alpha\) will be in Y

  • Naturally, we want to choose an \(\alpha\) that minimizes the total risk, or variance, of our investment. In other words, we want to minimize \(Var(\alpha X + (1 − \alpha) Y)\).

Illustration: Portfolio example

  • One can show that the value that minimizes the risk is given by

Illustration: Portfolio example

  • But the values of \(\sigma_{x}^2\), \(\sigma_{y}^2\) , and \(\sigma_{xy}\) are unknown.

  • We can compute estimates for these quantities, \(\hat{\sigma}_{x}^2\) , \(\hat{\sigma}_{y}^2\) , and \(\hat{\sigma}_{xy}\) , using a data set that contains measurements for X and Y.

  • We can then estimate the value of α that minimizes the variance of our investment using:

Illustration: Portfolio example

  • To estimate the standard deviation of \(\hat{\alpha}\), we repeated the process of simulating 100 paired observations of X and Y , and estimating \(\alpha\) 1,000 times.

  • We thereby obtained 1,000 estimates for \(\alpha\), which we can call \(\hat{\alpha}_{1}\) ,\(\hat{\alpha}_{2}\) ,...,\(\hat{\alpha}_{1000}\).

  • For these simulations the parameters were set to \(\sigma_{x}^2\) = 1,\(\sigma_{y}^2\) = 1.25, and \(\sigma_{xy}\) = 0.5, and so we know that the true value of α is 0.6.

Illustration: Portfolio example

  • Left: Histogram of the estimates of \(\alpha\) obtained by generating 1,000 simulated data sets from the true population. Center: A histogram of the estimates of \(\alpha\)obtained from 1,000 bootstrap samples from a single data set. Right: Boxplots of estimates of \(\alpha\) displayed in the left and center panels. In each panel, the pink line indicates the true value of \(\alpha\).

Illustration: Portfolio example

  • we cannot generate new samples from the original population.

  • However, the bootstrap approach allows us to use a computer to mimic the process of obtaining new data sets, so that we can estimate the variability of our estimate without generating additional samples

In reality....

  • Rather than repeatedly obtaining independent data sets from the population, we instead obtain distinct data sets by repeatedly sampling observations from the original data set with replacement.

  • Each of these “bootstrap data sets” is created by sampling with replacement, and is the same size as our original dataset. As a result some observations may appear more than once in a given bootstrap data set and some not at all.

In reality....

Q & A

Title Text

Title Text

Title Text

Title Text