Social and Political Data Science: Introduction

### Knowledge Mining

Karl Ho

School of Economic, Political and Policy Sciences

University of Texas at Dallas

# Supervised Learning: Classification

## Classification

• Orange: default, Blue: not

• Overall default rate: 3%

• Higher balance tend to default

• Income has any impact?

## Logistic Regression

• Linear regression vs. Logistic regression

• Using linear regression, predictions can go out of bound!

## Logistic Regression

• Probability of default given balance can be written as:

Pr(default = Yes|balance).

• Prediction using p(x)>.5, where x is the predictor variable (e.g. balance)

• Can set other or lower threshold (e.g. p(x)>.3

## Logistic Regression

• To model p(X), we need a function that gives outputs between 0 and 1 for all values of X.

• In logistic regression, we use the logistic function,

## Logistic Model

• The numerator is called the odds

• Which is the same as:

• The odds can be understood as the ratio of probabilities between on and off cases (1,0)

• For example, on average 1 in 5 people with an odds of 1/4 will default, since p(X) = 0.2 implies an odds of
0.2/(1-0.2) = 1/4.

## Logistic Model

• Taking the log of both sides, we have:

## Logistic Model

• The left-hand side is called the log-odds or logit, which can be estimated as a linear regression with X.

• Note that however, p(X) and X are not in linear relationship.

• Logistic Regression can be estimated using Maximum Likelihood.

• We seek estimates for  $$\beta_{0}$$  and $$\beta_{1}$$ such that the predicted probability $$\hat{p}(x_{i})$$  of default for each individual, using
corresponds as closely as possible to the individual’s observed default status.

• In other words, we try to find $$\beta_{0}$$  and $$\beta_{1}$$such that plugging these estimates into the model for p(X) yields a number close to one (1) for all individuals that fulfill the on condition (e.g. default) and a number close to zero for all individuals of off condition (e.g. not default).

## Logistic Regression

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.065e+01  3.612e-01  -29.49   <2e-16 ***
balance      5.499e-03  2.204e-04   24.95   <2e-16 ***

glm.fit=glm(default~balance,family=binomial)
summary(glm.fit)

Output:

Suppose an individual has a balance of \$1,000,

## Logistic Regression

Default
Student No Yes
No 6,850 206 0.029
Yes 2,817 127 0.043

How about the case of students?

> table(student,default)
default
student   No  Yes
No  6850  206
Yes 2817  127

We can actually calculate by hand the rate of default among students:

## Logistic Regression

> glm.sfit=glm(default~student,family=binomial)
> summary(glm.sfit)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.50413    0.07071  -49.55  < 2e-16 ***
studentYes   0.40489    0.11502    3.52 0.000431 ***
• We can extend the Logistic Regression to multiple right hand side variables.

## Multiple Logistic Regression

> glm.nfit=glm(default~balance+income+student,data=Default,family=binomial)
> summary(glm.nfit)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.087e+01  4.923e-01 -22.080  < 2e-16 ***
balance      5.737e-03  2.319e-04  24.738  < 2e-16 ***
income       3.033e-06  8.203e-06   0.370  0.71152
student[Yes]  -6.468e-01  2.363e-01  -2.738  0.00619 ** 

Why student becomes negative?  What does this mean?

## Multiple Logistic Regression

Confounding effect: when other predictors in model, student effect is different.

~1701-1761

## Bayes Theorem

Source: https://www.bayestheorem.net

## Discriminant Analysis

We classify a new point according to which density is highest.

The dashed line is called the Bayes decision boundary.

When the priors are different, we take them into account as well, and compare $$\pi_{k}$$$$f_{k}(x)$$. On the right, we favor the pink class — the decision boundary has shifted to the left.

## Discriminant Analysis

LDA improves prediction by accounting for the normally distributed x (continuous).

Bayesian decision boundary

LDA decision boundary

Why not linear regression?

## Linear Discriminant Analysis

When p=1, The Gaussian density has the form:

Here $$\mu_{k}$$ is the mean, and $$\sigma_{k}^{2}$$ the variance (in class k). We will assume that all the $$\sigma_{k}$$=$$\sigma$$are the same. Plugging this into Bayes formula, we get a rather complex expression for $$p_{k}(x)$$=$$Pr(Y=k|X=x)$$:

## Classification methods

### Confusion matrix: Default data

Correct

Incorrect

LDA error: (23+252)/10,000=2.75%

Yet, what matters is how many who actually defaulted were predicted?

## Sensitivity and Specificity

The Receiver Operating Characteristics (ROC) curve display the overall performance of a classifier, summarized over all possible thresholds, is given by the area under the (ROC) curve (AUC). An ideal ROC curve will hug the top left corner, so the larger the AUC the better the classifier.

### Linear Discriminant Analysis for p >1

Two multivariate Gaussian density functions are shown, with p = 2. Left: The two predictors are uncorrelated. Right: The two variables have a correlation of 0.7.

### Linear Discriminant Analysis for p >1

LDA with three classes with observations  from a multivariate Gaussian distribution(p = 2), with a class-specific mean vector and a common covariance matrix. Left: Ellipses contain 95 % of the probability for each of the three classes. The dashed lines are the Bayes decision boundaries. Right: 20 observations were generated from each class, and the corresponding LDA decision boundaries are indicated using solid black lines.