Social and Political Data Science: Introduction

### Knowledge Mining

Karl Ho

School of Economic, Political and Policy Sciences

University of Texas at Dallas

# Supervised Learning: Resampling methods

## Training error

### Training- versus Test-Set Performance

Source: ISLR Figure 2.7, p. 25

Again, beware of the evil of overfitting!

## Flexibility vs. Interpretability

Source: ISLR Figure 2.7, p. 25

In general, as the flexibility of a method increases, its interpretability decreases.

## Validation-set approach

$$MSE=\frac{1}{n} \sum_{i=1}^n(y_{i}-\hat{f}(x_{i}))^2$$

## Mean Squared Error (MSE)

Mean Squared Error is the extent to which the predicted response value for a given observation is close to
the true response value for that observation.

where $$\hat{f}(x_{i})$$ is the prediction that $$\hat{f}$$ gives for the $$i^{th}$$ observation.

The MSE will be small if the predicted responses are very close to the true responses, and will be large if for some of the observations, the predicted and true responses differ substantially.

## Validation-set approach

### Illustration: linear vs. polynomial

Validation 10 times

Conclusion: Quadratic form is the best.

### Leave-One-Out Cross-Validation (LOOCV)

A set of $$n$$ data points is repeatedly split into a training set containing all but one observation, and a validation set that contains only that observation. The test error is then estimated by averaging the $$n$$ resulting $$MSE$$’s. The first training set contains all but observation 1, the second training set contains all but observation 2, and so forth.

## K-fold Cross-validation

A set of $$n$$ observations is randomly split into five non-overlapping groups. Each of these fifths acts as a validation set, and the remainder as a training set. The test error is estimated by averaging the five resulting $$MSE$$ estimates.

## K-fold Cross-validation

LOOCV sometimes useful, but typically doesn’t shake up the data enough. The estimates from each fold are highly correlated and hence their average can have high variance.

## Bootstrap

The Baron had fallen to the bottom of a deep lake. Just when it looked like all was lost, he thought to pick himself up by his own bootstraps.

## Bootstrap

Each bootstrap data set contains $$n$$ observations, sampled with replacement from the original data set. Each bootstrap data set is used to obtain an estimate of $$\alpha$$

## Bootstrap vs. Jackknife

### Jackknife: resampling by leaving out observation

[1,2,3,4,5], [2,5,4,4,1], [1,3,2,5,5],......
[1,2,3,4,5], [2,5,4,1], [1,2,5],......

## Illustration: Portfolio example

• Left: Histogram of the estimates of $$\alpha$$ obtained by generating 1,000 simulated data sets from the true population. Center: A histogram of the estimates of $$\alpha$$obtained from 1,000 bootstrap samples from a single data set. Right: Boxplots of estimates of $$\alpha$$ displayed in the left and center panels. In each panel, the pink line indicates the true value of $$\alpha$$.