Social and Political Data Science: Introduction

### Knowledge Mining

Karl Ho

School of Economic, Political and Policy Sciences

University of Texas at Dallas

# Non-linear Models

## Polynomial Regression

• Create new variables $$X_1 = X, X_2 = X^2$$, etc and then treat as multiple linear regression.
• Not really interested in the coefficients; more interested in the fitted function values at any value $$x_0$$:

• Since $$\hat{f}x_0$$ is a linear function of the $$\beta_l$$, we can get a simple expression for pointwise-variances $$Var[\hat{f}x_0]$$ at any value $$x_0$$. In the figure we have computed the fit and pointwise standard errors on a grid of values for x0. We show $$\hat{f}x_0 ± 2 · se[\hat{f}x_0]$$.
• We either fix the degree $$d$$ at some reasonably low value, else use cross-validation to choose $$d$$.

$$\hat{f}(x_0) = \hat{\beta_0}+\hat{\beta_1}x_0+\hat{\beta_2}x_0^2+\hat{\beta_3}x_0^3+\hat{\beta_4}x_0^4$$

## Polynomial Regression

• Logistic regression follows naturally. For example, in figure we model:

• To get confidence intervals, compute upper and lower bounds on on the logit scale, and then invert to get on probability scale.
• Can do separately on several variables—just stack the variables into one matrix, and separate out the pieces afterwards (see GAMs later).
• Caveat: polynomials have notorious tail behavior — very bad for extrapolation.
• Can fit using $$y ∼ poly(x, degree = 3)$$ in formula.

$$\hat{f}(x_0) = \hat{\beta_0}+\hat{\beta_1}x_0+\hat{\beta_2}x_0^2+\hat{\beta_3}x_0^3+\hat{\beta_4}x_0^4$$

## Step functions

$$C_1(X) = I(X < 35), C_2(X) = I(35 <= X < 50),...,C_3(X) = I(X>= 65)$$

## Piecewise Polynomials

### Better to add constraints to the polynomials, e.g. continuity.

Smooth, local but no continuity

Top Left: The cubic polynomials unconstrained.

Top Right: The cubic polynomials constrained to be continuous at age=50. Bottom Left: Cubic polynomials constrained to be continuous, and to have continuous first and second derivatives. Bottom Right: A linear spline is shown, constrained to be continuous.

Knot at 50

## Linear Splines

### A linear spline with knots at $$\xi_k, k = 1,...,K$$ is a piecewise linear polynomial continuous at each knot.

$$y_{i}=\beta_0+\beta_1b_1(x_i)+\beta_2b_2(x_i)+ ... +\beta_{K+1}b_{K+1}(x_i) + \epsilon_i$$

### Here the $$()_+$$ means positive part; i.e.

Truncated function

Starting at 0 for continuity

## Linear Splines

Starting at 0 for continuity

# }

## Cubic Splines

### A cubic spline with knots at  $$\xi_k, k = 1,...,K$$ is a piecewise cubic polynomial with continuous derivatives up to order 2 at each knot.

$$y_{i}=\beta_0+\beta_1b_1(x_i)+\beta_2b_2(x_i)+ ... +\beta_{K+3}b_{K+3}(x_i) + \epsilon_i$$

Truncated power function

## Cubic Splines

### A cubic spline with knots at  $$\xi_k, k = 1,...,K$$ is a piecewise cubic polynomial with continuous derivatives up to order 2 at each knot.

$$y_{i}=\beta_0+\beta_1b_1(x_i)+\beta_2b_2(x_i)+ ... +\beta_{K+3}b_{K+3}(x_i) + \epsilon_i$$

Truncated power function

A natural cubic spline extrapolates linearly beyond the boundary knots. This adds 4 = 2 × 2 extra constraints, and allows us to put more internal knots for the same degrees of freedom as a regular cubic spline.

## Natural Cubic Splines

Natural cubic spline is better!

Adding the last term in the cubic polynomial will lead to a discontinuity in only the third derivative at $$\xi$$; the function will remain continuous, with continuous first and second derivatives, at each of the knots.

## Cubic Splines

Natural cubic spline is better!

Fitting splines in R is easy: $$bs(x, ...)$$ for any degree splines, and $$ns(x, ...)$$ for natural cubic splines, in package $$splines$$.

## Smoothing splines

### Consider this criterion for fitting a smooth function $$g(x)$$ to some data:

$$\underset{g \in S}\text{minimize}\sum_{i=1}^n(y_i-g(x_i))^2+\lambda\int g^"(t)^2dt$$

•  The first term is RSS, and tries to make $$g(x)$$ match the data at each $$x_i$$.
• The second term is a roughness penalty and controls how wiggly $$g(x)$$ is. It is modulated by the tuning parameter $$\lambda\leq 0$$.
• The smaller $$\lambda$$, the more wiggly the function, eventually interpolating $$y_i$$ when $$\lambda = 0$$.
• As $$\lambda \rightarrow \infty$$, the function $$g(x)$$ becomes linear.

## Smoothing splines

The solution is a natural cubic spline, with a knot at every unique value of $$x_i$$. The roughness penalty still controls the roughness via $$\lambda$$.

Some details

• Smoothing splines avoid the knot-selection issue, leaving a single $$\lambda$$ to be chosen.
• In $$R$$, the function $$smooth.spline()$$ will fit a smoothing spline.
• The vector of n fitted values can be written as $$\hat{g}_{\lambda} = \text S_λ\text y$$, where $$\text S_{\lambda}$$ is an $$n × n$$ matrix (determined by the $$x_i$$ and $$\lambda$$.
• The effective degrees of freedom are given by:
• $$df_{\lambda} = \sum^{n}_{i=1} \{\text S_{\lambda}\}_{ii\cdot}$$

## Smoothing splines

We can specify degree of freedom $$df$$ rather than $$\lambda$$!
In $$R: smooth.spline(age, wage, df = 10)$$

The leave-one-out (LOO) cross-validated error is given by:

$$\text {RSS}_{cv}(\lambda) = \sum^{n}_{i=1}(y_i-\hat{g}_{\lambda}^{(-i)}(x_i))^2 = \sum^{n}_{i=1}\Bigg[\frac{y_i - \hat{g}_{\lambda}(x_i)}{1 - \{\text S_{\lambda}\}_{ii}}\Bigg]^2$$

This is probably the most difficult equation to understand and type in $$LaTeX$$!

## Local regression

### Allows for flexible nonlinearities in several variables, but retains the additive structure of linear models.

• Can fit a GAM simply using, e.g. natural splines:

•  Coefficients not that interesting; fitted functions are. The previous plot was produced using $$plot.gam$$.

• Can mix terms — some linear, some nonlinear — and use $$anova()$$ to compare models.

• Can use smoothing splines or local regression as well:

• GAMs are additive, although low-order interactions can be included in a natural way using, e.g. bivariate smoothers or interactions of the form $$ns(age,df=5):ns(year,df=5)$$.

lm(wage ∼ ns(year, df = 5) + ns(age, df = 5) + education)
gam(wage ∼ s(year, df = 5) + lo(age, span = .5) + education)

$$log\Bigg(\frac{p(X)}{1-p(X)}\Bigg) = \beta_0+f_1(X_1)+f_2(X_2)+\cdot\cdot\cdot+f_p(X_p)$$
gam(I(wage > 250) ∼ year + s(age, df = 5) + education, family = binomial)