Social and Political Data Science: Introduction

Knowledge Mining

Karl Ho

School of Economic, Political and Policy Sciences

University of Texas at Dallas

Non-linear Models 

Linear models:

  • Simple

  • Interpretable

  • Great for prediction

  • Easy to fit

Linear models

Non-Linear models:

  • Simple

  • Interpretable

  • Great for prediction

  • Easy to fit

Non-Linear models

? ? ?

? ? ?

? ? ?

? ? ?

Not necessarily!!

  • The truth is never or almost linear!

  • Yet often the linearity assumption is good.

  • What is linearity?

 

Non-Linear models

Non-Linear models

  • polynomials,

  • step functions,

  • splines,

  • local regression

  • generalized additive  models

  • Can be simple and interpretable as linear models.

  • Other methods: Isotonic regression

Polynomial Regression

$$y_{i}=\beta_0+\beta_1x_i+\beta_2x_i^2+\beta_3x_i^3 + ... +\beta_dx_i^d + \epsilon_i$$

Polynomial regression extends the linear model by adding extra predictors, obtained by raising each of the original predictors to a power. For example, a cubic regression uses three variables, \(X, X^2 and  X^3\), as predictors. This approach provides a simple way to provide a non- linear fit to data.

Polynomial Regression

$$y_{i}=\beta_0+\beta_1x_i+\beta_2x_i^2+\beta_3x_i^3 + ... +\beta_dx_i^d + \epsilon_i$$

Polynomial Regression

  • Create new variables \(X_1 = X, X_2 = X^2\), etc and then treat as multiple linear regression.
  • Not really interested in the coefficients; more interested in the fitted function values at any value \(x_0\):

     
  • Since \(\hat{f}x_0\) is a linear function of the \(\beta_l\), we can get a simple expression for pointwise-variances \(Var[\hat{f}x_0]\) at any value \(x_0\). In the figure we have computed the fit and pointwise standard errors on a grid of values for x0. We show \(\hat{f}x_0 ± 2 · se[\hat{f}x_0]\).
  • We either fix the degree \(d\) at some reasonably low value, else use cross-validation to choose \(d\).

$$ \hat{f}(x_0) = \hat{\beta_0}+\hat{\beta_1}x_0+\hat{\beta_2}x_0^2+\hat{\beta_3}x_0^3+\hat{\beta_4}x_0^4 $$

Polynomial Regression

  • Logistic regression follows naturally. For example, in figure we model:



     
  • To get confidence intervals, compute upper and lower bounds on on the logit scale, and then invert to get on probability scale.
  • Can do separately on several variables—just stack the variables into one matrix, and separate out the pieces afterwards (see GAMs later).
  • Caveat: polynomials have notorious tail behavior — very bad for extrapolation.
  • Can fit using \(y ∼ poly(x, degree = 3)\) in formula.
     

$$ \hat{f}(x_0) = \hat{\beta_0}+\hat{\beta_1}x_0+\hat{\beta_2}x_0^2+\hat{\beta_3}x_0^3+\hat{\beta_4}x_0^4 $$

Step functions

Step functions cut the range of a variable into \(K\) distinct regions in order to produce a qualitative variable. This has the effect of fitting a piecewise constant function.

$$y_{i}=\beta_0+\beta_1(C_1)x_i+\beta_2(C_2)x_i+ ... +\beta_K(C_K)x_i + \epsilon_i$$

Step functions

\(C_1(X) = I(X < 35), C_2(X) = I(35 <= X < 50),...,C_3(X) = I(X>= 65)\)

Step functions

  • Easy to work with. Creates a series of dummy variables representing each group.

  • Useful way of creating interactions that are easy to interpret. For example, interaction effect of Year and Age:

  • $$ I (Year < 2005) · Age, I (Year 2005) · Age $$would allow for different linear functions in each age category.

  • Choice of cutpoints or knots can be problematic. 

Regression splines

Regression splines are more flexible than polynomials and step functions, and in fact are an extension of the two. They involve dividing the range of \(X\) into \(K\) distinct regions.

  • Within each region, a polynomial function is fit to the data. However, these polynomials are constrained so that they join smoothly at the region boundaries, or knots. Provided that the interval is divided into enough regions, this can produce an extremely flexible fit.

Piecewise Polynomials

$$y_{i}=\beta_0+\beta_1x_i+\beta_2x_i^2+\beta_3x_i^3 +  \epsilon_i$$

Piecewise polynomial regression fits separate low-degree polynomials over different regions of \(X\). For example,

where the coefficients \(\beta_0, \beta_1, \beta_2, and  \beta_3\) differ in different parts of the range of \(X\). The points where the coefficients change are called \(knots\).

Piecewise Polynomials

Better to add constraints to the polynomials, e.g. continuity.

Smooth, local but no continuity

Top Left: The cubic polynomials unconstrained.

Top Right: The cubic polynomials constrained to be continuous at age=50. Bottom Left: Cubic polynomials constrained to be continuous, and to have continuous first and second derivatives. Bottom Right: A linear spline is shown, constrained to be continuous.

Knot at 50

Linear Splines

A linear spline with knots at \(\xi_k, k = 1,...,K\) is a piecewise linear polynomial continuous at each knot.

$$y_{i}=\beta_0+\beta_1b_1(x_i)+\beta_2b_2(x_i)+ ... +\beta_{K+1}b_{K+1}(x_i) + \epsilon_i$$

where the \(b_k\) are basis functions.

Here the \(()_+\) means positive part; i.e.

Truncated function

Starting at 0 for continuity

Linear Splines

Starting at 0 for continuity

}

Cubic Splines

A cubic spline with knots at  \(\xi_k, k = 1,...,K\) is a piecewise cubic polynomial with continuous derivatives up to order 2 at each knot.

$$y_{i}=\beta_0+\beta_1b_1(x_i)+\beta_2b_2(x_i)+ ... +\beta_{K+3}b_{K+3}(x_i) + \epsilon_i$$

Truncated power function

Cubic Splines

A cubic spline with knots at  \(\xi_k, k = 1,...,K\) is a piecewise cubic polynomial with continuous derivatives up to order 2 at each knot.

$$y_{i}=\beta_0+\beta_1b_1(x_i)+\beta_2b_2(x_i)+ ... +\beta_{K+3}b_{K+3}(x_i) + \epsilon_i$$

Truncated power function

A natural cubic spline extrapolates linearly beyond the boundary knots. This adds 4 = 2 × 2 extra constraints, and allows us to put more internal knots for the same degrees of freedom as a regular cubic spline.

Natural Cubic Splines

Natural cubic spline is better!

Adding the last term in the cubic polynomial will lead to a discontinuity in only the third derivative at \(\xi\); the function will remain continuous, with continuous first and second derivatives, at each of the knots.

Cubic Splines

Natural cubic spline is better!

Fitting splines in R is easy: \(bs(x, ...)\) for any degree splines, and \(ns(x, ...)\) for natural cubic splines, in package \(splines\).

Cubic Splines

Knot placement

One strategy is to decide \(K\), the number of knots, and then place them at appropriate quantiles of the observed \(X\).

  • A cubic spline with \(K\) knots has \(K + 4 \)parameters or degrees of freedom.

  • A natural spline with \(K\) knots has \(K\)degrees of freedom.

Knot placement

Comparison of a degree-14 polynomial and a natural cubic spline, each with 15 df.

In R:
ns(age, df=14)
poly(age, deg=14)

Smoothing splines

Smoothing splines are similar to regression splines, but arise in a slightly different situation. Smoothing splines result from minimizing a residual sum of squares criterion subject to a smoothness penalty.

Smoothing splines

Consider this criterion for fitting a smooth function \(g(x)\) to some data:

$$ \underset{g \in S}\text{minimize}\sum_{i=1}^n(y_i-g(x_i))^2+\lambda\int g^"(t)^2dt $$

  •  The first term is RSS, and tries to make \(g(x)\) match the data at each \(x_i\).
  • The second term is a roughness penalty and controls how wiggly \(g(x)\) is. It is modulated by the tuning parameter \(\lambda\leq  0\).
  • The smaller \(\lambda\), the more wiggly the function, eventually interpolating \(y_i\) when \(\lambda = 0\).
  • As \(\lambda \rightarrow  \infty\), the function \(g(x)\) becomes linear.

Smoothing splines

The solution is a natural cubic spline, with a knot at every unique value of \(x_i\). The roughness penalty still controls the roughness via \(\lambda\).

Some details

  • Smoothing splines avoid the knot-selection issue, leaving a single \(\lambda\) to be chosen.
  • In \(R\), the function \(smooth.spline()\) will fit a smoothing spline.
  • The vector of n fitted values can be written as \(\hat{g}_{\lambda} = \text S_λ\text y\), where \(\text S_{\lambda}\) is an \(n × n\) matrix (determined by the \(x_i\) and \(\lambda\).
  • The effective degrees of freedom are given by:
  • $$ df_{\lambda} = \sum^{n}_{i=1} \{\text S_{\lambda}\}_{ii\cdot} $$

Smoothing splines

We can specify degree of freedom \(df\) rather than \(\lambda\)!
In \(R: smooth.spline(age, wage, df = 10)\)

 

The leave-one-out (LOO) cross-validated error is given by:

 

$$ \text {RSS}_{cv}(\lambda) = \sum^{n}_{i=1}(y_i-\hat{g}_{\lambda}^{(-i)}(x_i))^2 = \sum^{n}_{i=1}\Bigg[\frac{y_i - \hat{g}_{\lambda}(x_i)}{1 - \{\text S_{\lambda}\}_{ii}}\Bigg]^2$$

This is probably the most difficult equation to understand and type in \(LaTeX\)!

Smoothing splines

Local regression

Local regression is similar to splines, but differs in an important way. The regions are allowed to overlap, and indeed they do so in a very smooth way.

Local regression

With a sliding weight function, we fit separate linear fits over the range of X by weighted least squares. Use \(loess()\) function in R.

Generalized additive models accommodates different nonlinear methods to deal with multiple predictors.

Generalized additive models (GAM)

Generalized additive models (GAM)

Generalized additive models (GAM)

Allows for flexible nonlinearities in several variables, but retains the additive structure of linear models.

Generalized additive models (GAM)

  • Can fit a GAM simply using, e.g. natural splines:
     

  •  Coefficients not that interesting; fitted functions are. The previous plot was produced using \(plot.gam\).

  • Can mix terms — some linear, some nonlinear — and use \(anova()\) to compare models.

  • Can use smoothing splines or local regression as well:

     

  • GAMs are additive, although low-order interactions can be included in a natural way using, e.g. bivariate smoothers or interactions of the form \(ns(age,df=5):ns(year,df=5)\).

lm(wage ∼ ns(year, df = 5) + ns(age, df = 5) + education)
gam(wage ∼ s(year, df = 5) + lo(age, span = .5) + education)

Generalized additive models (GAM)

$$ log\Bigg(\frac{p(X)}{1-p(X)}\Bigg) = \beta_0+f_1(X_1)+f_2(X_2)+\cdot\cdot\cdot+f_p(X_p)$$

gam(I(wage > 250) ∼ year + s(age, df = 5) + education, family = binomial)

Isotonic regression

Local regression is similar to splines, but differs in an important way. The regions are allowed to overlap, and indeed they do so in a very smooth way.