Karl Ho
School of Economic, Political and Policy Sciences
University of Texas at Dallas
Orange: default, Blue: not
Overall default rate: 3%
Higher balance tend to default
Income has any impact?
Linear regression vs. Logistic regression
Using linear regression, predictions can go out of bound!
Probability of default given balance can be written as:
Pr(default = Yes|balance).
Prediction using p(x)>.5, where x is the predictor variable (e.g. balance)
Can set other or lower threshold (e.g. p(x)>.3
To model p(X), we need a function that gives outputs between 0 and 1 for all values of X.
In logistic regression, we use the logistic function,
The numerator is called the odds
Which is the same as:
The odds can be understood as the ratio of probabilities between on and off cases (1,0)
For example, on average 1 in 5 people with an odds of 1/4 will default, since p(X) = 0.2 implies an odds of
0.2/(1-0.2) = 1/4.
Taking the log of both sides, we have:
The left-hand side is called the log-odds or logit, which can be estimated as a linear regression with X.
Note that however, p(X) and X are not in linear relationship.
Logistic Regression can be estimated using Maximum Likelihood.
We seek estimates for \(\beta_{0}\) and \(\beta_{1}\) such that the predicted probability \(\hat{p}(x_{i})\) of default for each individual, using
corresponds as closely as possible to the individual’s observed default status.
In other words, we try to find \(\beta_{0}\) and \(\beta_{1}\)such that plugging these estimates into the model for p(X) yields a number close to one (1) for all individuals that fulfill the on condition (e.g. default) and a number close to zero for all individuals of off condition (e.g. not default).
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.065e+01 3.612e-01 -29.49 <2e-16 ***
balance 5.499e-03 2.204e-04 24.95 <2e-16 ***
glm.fit=glm(default~balance,family=binomial) summary(glm.fit)
Output:
Suppose an individual has a balance of $1,000,
| Default | |||
|---|---|---|---|
| Student | No | Yes | |
| No | 6,850 | 206 | 0.029 |
| Yes | 2,817 | 127 | 0.043 |
How about the case of students?
> table(student,default)
default
student No Yes
No 6850 206
Yes 2817 127
We can actually calculate by hand the rate of default among students:
> glm.sfit=glm(default~student,family=binomial)
> summary(glm.sfit)
Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.50413 0.07071 -49.55 < 2e-16 *** studentYes 0.40489 0.11502 3.52 0.000431 ***
We can extend the Logistic Regression to multiple right hand side variables.
> glm.nfit=glm(default~balance+income+student,data=Default,family=binomial)
> summary(glm.nfit)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.087e+01 4.923e-01 -22.080 < 2e-16 ***
balance 5.737e-03 2.319e-04 24.738 < 2e-16 ***
income 3.033e-06 8.203e-06 0.370 0.71152
student[Yes] -6.468e-01 2.363e-01 -2.738 0.00619 **
Why student becomes negative? What does this mean?
Confounding effect: when other predictors in model, student effect is different.
~1701-1761
Source: https://www.bayestheorem.net
We classify a new point according to which density is highest.
The dashed line is called the Bayes decision boundary.
When the priors are different, we take them into account as well, and compare \(\pi_{k}\)\(f_{k}(x)\). On the right, we favor the pink class — the decision boundary has shifted to the left.
LDA improves prediction by accounting for the normally distributed x (continuous).
Bayesian decision boundary
LDA decision boundary
Why not linear regression?
When p=1, The Gaussian density has the form:
Here \(\mu_{k}\) is the mean, and \(\sigma_{k}^{2}\) the variance (in class k). We will assume that all the \(\sigma_{k}\)=\(\sigma\)are the same. Plugging this into Bayes formula, we get a rather complex expression for \(p_{k}(x)\)=\(Pr(Y=k|X=x)\):
Correct
Incorrect
LDA error: (23+252)/10,000=2.75%
Yet, what matters is how many who actually defaulted were predicted?
The Receiver Operating Characteristics (ROC) curve display the overall performance of a classifier, summarized over all possible thresholds, is given by the area under the (ROC) curve (AUC). An ideal ROC curve will hug the top left corner, so the larger the AUC the better the classifier.
Two multivariate Gaussian density functions are shown, with p = 2. Left: The two predictors are uncorrelated. Right: The two variables have a correlation of 0.7.
LDA with three classes with observations from a multivariate Gaussian distribution(p = 2), with a class-specific mean vector and a common covariance matrix. Left: Ellipses contain 95 % of the probability for each of the three classes. The dashed lines are the Bayes decision boundaries. Right: 20 observations were generated from each class, and the corresponding LDA decision boundaries are indicated using solid black lines.