### Gentle Introduction to Machine Learning

Karl Ho

School of Economic, Political and Policy Sciences

University of Texas at Dallas

Workshop prepared for International Society for Data Science and Analytics (ISDSA) Annual Meeting, Notre Dame University, June 2nd, 2022.

## Four important questions

### 1. Is there a relationship between the response and predictors?

1. There are a total of $$2^p$$ models that contain subsets of $$p$$ variables.

1. For example, when $$p=2$$, there will be $$2^2=4$$ models.
1. model(1) with no variables
2. model(2) with $$X_1$$ only
3. model(3) with $$X_2$$ only
4. model(4) with both $$X_1$$ and $$X_2$$.
2. When $$p$$ becomes large, we need to consider variable selection methods:
2. Backward selection: start with full model then remove variable with biggest p-value

### 2. Deciding on Important Variables

1. Residual Standard Error (RSE) and $$R^2$$, the fraction of variance explained can be used for model fit.  The lower of RSE, the better.  The higher the $$R^2$$, the better.
2. More predictors may not be always better.  (e.g. Advertising example)
3. Linear assumption is tested.
4. Graphical analysis will help identifying more subtle relationship in different combination of RHS variables including interaction terms.

### 3. Model fit

• From the pattern of the residuals, we can see that there is a pronounced non-linear relationship in the data.
• The positive residuals (those visible above the surface), tend to lie along the 45-degree line, where TV and Radio budgets are split evenly.
• The negative residuals (most not visible), tend to lie away from this line, where budgets are more lopsided.
1. Three kinds of uncertainties

1. Reducible error: inaccurate coefficient estimates

1. Confidence intervals can be used.

2. Model bias: linearity assumption tested

3. Irreducible error: random error $$\epsilon$$ in the model

1. Prediction intervals

2. Prediction intervals are always wider than confidence intervals, because they incorporate both the error in the estimate for $$f(X)$$ (the reducible error) and the uncertainty as to how much an individual point will differ from the population regression plane (the irreducible error).

## Training error

### Training- versus Test-Set Performance

Source: ISLR Figure 2.7, p. 25

Again, beware of the evil of overfitting!

## Flexibility vs. Interpretability

Source: ISLR Figure 2.7, p. 25

In general, as the flexibility of a method increases, its interpretability decreases.

## Validation-set approach

$$MSE=\frac{1}{n} \sum_{i=1}^n(y_{i}-\hat{f}(x_{i}))^2$$

## Mean Squared Error (MSE)

Mean Squared Error is the extent to which the predicted response value for a given observation is close to
the true response value for that observation.

where $$\hat{f}(x_{i})$$ is the prediction that $$\hat{f}$$ gives for the $$i^{th}$$ observation.

The MSE will be small if the predicted responses are very close to the true responses, and will be large if for some of the observations, the predicted and true responses differ substantially.

## Validation-set approach

### Illustration: linear vs. polynomial

Validation 10 times

Conclusion: Quadratic form is the best.

### Leave-One-Out Cross-Validation (LOOCV)

A set of $$n$$ data points is repeatedly split into a training set containing all but one observation, and a validation set that contains only that observation. The test error is then estimated by averaging the $$n$$ resulting $$MSE$$’s. The first training set contains all but observation 1, the second training set contains all but observation 2, and so forth.

## K-fold Cross-validation

A set of $$n$$ observations is randomly split into five non-overlapping groups. Each of these fifths acts as a validation set, and the remainder as a training set. The test error is estimated by averaging the five resulting $$MSE$$ estimates.

## K-fold Cross-validation

LOOCV sometimes useful, but typically doesn’t shake up the data enough. The estimates from each fold are highly correlated and hence their average can have high variance.

## Bootstrap

The Baron had fallen to the bottom of a deep lake. Just when it looked like all was lost, he thought to pick himself up by his own bootstraps.

## Bootstrap

Each bootstrap data set contains $$n$$ observations, sampled with replacement from the original data set. Each bootstrap data set is used to obtain an estimate of $$\alpha$$

## Bootstrap vs. Jackknife

### Jackknife: resampling by leaving out observation

[1,2,3,4,5], [2,5,4,4,1], [1,3,2,5,5],......
[1,2,3,4,5], [2,5,4,1], [1,2,5],......

## Illustration: Portfolio example

Left: Histogram of the estimates of $$\alpha$$ obtained by generating 1,000 simulated data sets from the true population. Center: A histogram of the estimates of $$\alpha$$obtained from 1,000 bootstrap samples from a single data set. Right: Boxplots of estimates of $$\alpha$$ displayed in the left and center panels. In each panel, the pink line indicates the true value of $$\alpha$$.

## Classification:

• Dealing with qualitative variables such as groups or categories or survival or not.
• Recognition methods, AI

## Classification

• Orange: default, Blue: not

• Overall default rate: 3%

• Higher balance tend to default

• Income has any impact?

## Logistic Regression

• Linear regression vs. Logistic regression

• Using linear regression, predictions can go out of bound!

## Logistic Regression

• Probability of default given balance can be written as:

$$Pr(default = Yes | balance)$$.

• Prediction using $$p(x)>.5$$, where x is the predictor variable (e.g. balance)

• Can set other or lower threshold (e.g. $$p(x)>.3$$)

## Logistic Regression

• To model $$p(X)$$, we need a function that gives outputs between 0 and 1 for all values of $$X$$.

• In logistic regression, we use the logistic function,

## Logistic Model

• The numerator is called the odds

• Which is the same as:

$p(X) = \frac{e ^ {\beta_0+\beta_1X}}{1+e ^ {\beta_0+\beta_1X}}$

$e^{\beta_0+\beta_1X}$

$\frac{p(X)}{1-p(X)}$

• $$e ≈ 2.71828$$ is a mathematical constant [Euler’s number.]  No matter what values $$\beta_0$$, $$\beta_1$$ or $$X$$ take, $$p(X)$$ will have values between 0 and 1.

• The odds can be understood as the ratio of probabilities between on and off cases (1,0)

• For example, on average 1 in 5 people with an odds of 1/4 will default, since $$p(X) = 0.2$$ implies an odds of
$$0.2/(1-0.2) = 1/4$$.

## Logistic Model

$\frac{p(X)}{1-p(X)}$

• Taking the log of both sides, we have:

## Logistic Model

• The left-hand side is called the log-odds or logit, which can be estimated as a linear regression with X.

• Note that however, $$p(X)$$ and $$X$$ are not in linear relationship.

$\frac{p(X)}{1-p(X)}=e^{\beta_0+\beta_1X}$

$log(\frac{p(X)}{1-p(X)})=\beta_0+\beta_1X$

• Logistic Regression can be estimated using Maximum Likelihood.

• We seek estimates for  $$\beta_{0}$$  and $$\beta_{1}$$ such that the predicted probability $$\hat{p}(x_{i})$$  of default for each individual, using $$p(X) = \frac{e ^ {\beta_0+\beta_1X}}{1+e ^ {\beta_0+\beta_1X}}$$ corresponds as closely as possible to the individual’s observed default status.

• In other words, we try to find $$\beta_{0}$$  and $$\beta_{1}$$such that plugging these estimates into the model for $$p(X)$$ yields a number close to one (1) for all individuals that fulfill the on condition (e.g. default) and a number close to zero (0) for all individuals of off condition (e.g. not default).

## Logistic Regression

• Maximum likelihood to estimate the parameters: $\ell(\beta_0, \beta) = \prod_{i:y_i=1}p(x_i)\prod_{i:y_i=0}(1-p(x_i))$

• This likelihood gives the probability of the observed zeros and ones in the data. We pick $$\beta_0$$ and $$\beta_1$$ to maximize the likelihood of the observed data.

• Most statistical packages can fit linear logistic regression models by maximum likelihood. In R, we use the glm function.

## Logistic Regression

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.065e+01  3.612e-01  -29.49   <2e-16 ***
balance      5.499e-03  2.204e-04   24.95   <2e-16 ***


Output:

Suppose an individual has a balance of \$1,000,

glm.fit=glm(default~balance,family=binomial)
summary(glm.fit)

## Logistic Regression

Default
Student No Yes
No 6,850 206 0.029
Yes 2,817 127 0.043

How about the case of students?

> table(student,default)
default
student   No  Yes
No  6850  206
Yes 2817  127

We can actually calculate by hand the rate of default among students:

## Logistic Regression

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.50413    0.07071  -49.55  < 2e-16 ***
studentYes   0.40489    0.11502    3.52 0.000431 ***
glm.sfit=glm(default~student,family=binomial)
summary(glm.sfit)
• We can extend the Logistic Regression to multiple right hand side variables.

## Multiple Logistic Regression

> glm.nfit=glm(default~balance+income+student,data=Default,family=binomial)
> summary(glm.nfit)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.087e+01  4.923e-01 -22.080  < 2e-16 ***
balance      5.737e-03  2.319e-04  24.738  < 2e-16 ***
income       3.033e-06  8.203e-06   0.370  0.71152
student[Yes]  -6.468e-01  2.363e-01  -2.738  0.00619 ** 

Why student becomes negative?  What does this mean?

## Multiple Logistic Regression

Confounding effect: when other predictors in model, student effect is different.

• Students tend to have higher balances than non-students, so their marginal default rate is higher than for non-students.
• But for each level of balance, students default less than non-students.
• Multiple logistic regression can show this with other variables included (controlled for).

## Discriminant Analysis

We classify a new point according to which density is highest.

The dashed line is called the Bayes decision boundary.

When the priors are different, we take them into account as well, and compare $$\pi_{k}$$$$f_{k}(x)$$. On the right, we favor the light purple class — the decision boundary has shifted to the left.

## Discriminant Analysis

LDA improves prediction by accounting for the normally distributed x (continuous).

Bayesian decision boundary

LDA decision boundary

Why not linear regression?

## Linear Discriminant Analysis

When p=1, The Gaussian density has the form:

Here $$\mu_{k}$$ is the mean, and $$\sigma_{k}^{2}$$ the variance (in class k). We will assume that all the $$\sigma_{k}$$=$$\sigma$$are the same. Plugging this into Bayes formula, we get a rather complex expression for $$p_{k}(x)$$=$$Pr(Y=k|X=x)$$:

~1701-1761

## Bayes Theorem

Source: https://www.bayestheorem.net

## Classification methods

### Linear Discriminant Analysis for p >1

Two multivariate Gaussian density functions are shown, with p = 2. Left: The two predictors are uncorrelated. Right: The two variables have a correlation of 0.7.

### Linear Discriminant Analysis for p >1

LDA with three classes with observations  from a multivariate Gaussian distribution(p = 2), with a class-specific mean vector and a common covariance matrix. Left: Ellipses contain 95 % of the probability for each of the three classes. The dashed lines are the Bayes decision boundaries. Right: 20 observations were generated from each class, and the corresponding LDA decision boundaries are indicated using solid black lines.

## Sensitivity and Specificity

The Receiver Operating Characteristics (ROC) curve display the overall performance of a classifier, summarized over all possible thresholds, is given by the area under the (ROC) curve (AUC). An ideal ROC curve will hug the top left corner, so the larger the AUC the better the classifier.

### Confusion matrix: Default data

Correct

Incorrect

LDA error: (23+252)/10,000=2.75%

Yet, what matters is how many who actually defaulted were predicted?

## Tree-based methods

A partition of two-dimensional feature space that could not result from recursive binary splitting.

A perspective plot of the prediction surface corresponding to that tree.

A tree corresponding to the partition in the top right panel.

The output of recursive binary splitting on a two-dimensional example.

## Tree-building process

### Splitting the X Variables

• Generally we create the partitions by iteratively splitting one of the X variables into two regions

$$t_3$$

$$X_1$$

$$X_2$$

1. First split on
$$X_1=t_1$$
2. If $$X_1<t_1$$, split on$$X_2=t_2$$
3. If $$X_1>t_1$$, split on $$X_1=t_3$$
4. If $$X_1>t_3$$, split on $$X_2=t_4$$

$$t_1$$

$$t_2$$

$$t_4$$

$$R_1$$

$$R_2$$

$$R_3$$

$$R_5$$

$$R_4$$

### Splitting the X Variables

• When we create partitions this way we can always represent them using a tree structure.
• This provides a very simple way to explain the model to a non-expert.

## Baseball Salary data

### Baseball salary data

117.5

4.5

Here we have two predictors and three distinct regions

## Terminology for Trees

• A classification tree is very similar to a regression tree except that we try to make a prediction for a categorical rather than continuous Y.

• For each region (or node) we predict the most common category among the training data within that region.
The tree is grown (i.e. the splits are chosen) in exactly the same way as with a regression tree except that minimizing MSE no longer makes sense.

• There are several possible different criteria to use such as the “gini index” and “cross-entropy” but the easiest one to think about is to minimize the error rate.

## Classification Tree

• A large tree (i.e. one with many terminal nodes) may tend to over fit the training data in a similar way to neural networks without a weight decay.

• Generally, we can improve accuracy by “pruning” the tree i.e. cutting off some of the terminal nodes.

• How do we know how far back to prune the tree? We use cross validation to see which tree has the lowest error rate.

## Tree Pruning

### Baseball Players’ Salaries

What tree size do we use?

• If the relationship between the predictors and response is linear, then classical linear models such as linear regression would outperform regression trees

• On the other hand, if the relationship between the predictors is non-linear, then decision trees would outperform classical approaches

## Issues

### Decision Trees

Good Not so good
Highly interpretable (even better than regression) Prediction Accuracy
Plotable High variance
for qualitative predictors (without create dummy variables)
For both regression and classification

• ### For prediction, there are two approaches:

• Record the class that each bootstrapped data set predicts and provide an overall prediction to the most commonly occurring one (majority vote).

• If our classifier produces probability estimates we can just average the probabilities and then predict to the class with the highest probability.

### Error rates

• Green line represents a simple majority vote approach

• Purple line corresponds to averaging the probability estimates.

• Both do far better than a single tree (dashed red) and get close to the Bayes error rate (dashed grey).

### Error rates

• On average, each bagged tree makes use of around two-thirds of the observations.

• The remaining one-third of the observations not used to fit a given bagged tree are referred to as the out-of-bag (OOB) observations.

• Predict the response for the $$i^{th}$$ observation using each of the trees in which that observation was OOB. This will yield around $$\frac{B}{3}$$ predictions for the $$i^{th}$$ observation, which we average.

• This estimate is essentially the Leave One Out (LOO) cross-validation error for bagging, if $$B$$ is large

### Out-of-Bag (OOB) Error Estimation

• Bagging typically improves the accuracy over prediction using a single tree, but it is now hard to interpret the model!

• We have hundreds of trees, and it is no longer clear which variables are most important to the procedure
Thus bagging improves prediction accuracy at the expense of interpretability.

• But, we can still get an overall summary of the importance of each predictor using Relative Influence Plots

### Variable Importance Measure

• Relative Influence Plots give a score for each variable.

• These scores represents the decrease in MSE when splitting on a particular variable

• A number close to zero indicates the variable is not important and could be dropped.

• The larger the score the more influence the variable has.

### Relative Influence Plots

• Relative Influence Plots give a score for each variable.

• These scores represents the decrease in MSE when splitting on a particular variable

• A number close to zero indicates the variable is not important and could be dropped.

• The larger the score the more influence the variable has.

### Relative Influence Plots

• age
• sex
• chest pain type (4 values)
• resting blood pressure
• serum cholestoral in mg/dl
• fasting blood sugar > 120 mg/dl
• resting electrocardiographic results (values 0,1,2)
• maximum heart rate achieved
• exercise induced angina
• oldpeak = ST depression induced by exercise relative to rest
• the slope of the peak exercise ST segment
• number of major vessels (0-3) colored by flourosopy
• thal (Thalassemias): 3 = normal; 6 = fixed defect; 7 = reversable defect

### Relative Influence Plots: Heart data

age
sex
chest pain type (4 values)
resting blood pressure
serum cholestoral in mg/dl
fasting blood sugar > 120 mg/dl
resting electrocardiographic results (values 0,1,2)
maximum heart rate achieved
exercise induced angina
oldpeak = ST depression induced by exercise relative to rest
the slope of the peak exercise ST segment
number of major vessels (0-3) colored by flourosopy
thal (Thalassemias): 3 = normal; 6 = fixed defect; 7 = reversable defect

### Random Forests

• Why a random sample of $$m$$ predictors instead of all $$p$$ predictors for splitting?
• Suppose that we have a very strong predictor in the data set along with a number of other moderately strong predictor, then in the collection of bagged trees, most or all of them will use the very strong predictor for the first split!
• All bagged trees will look similar. Hence all the predictions from the bagged trees will be highly correlated
• Averaging many highly correlated quantities does not lead to a large variance reduction
• Random forests “de-correlates” the bagged trees leading to more reduction in variance

### Random Forests

When $$m = p$$,  this is simply the bagging model.

### Random Forests

• Building on the idea of bagging, random forest provides an improvement because it de-correlates the trees

• Procedures:

• Build a number of decision trees on bootstrapped training sample

• When building the trees, each time a split in a tree is considered

• a random sample of $$m$$ predictors is chosen as split candidates from the full set of $$p$$ predictors (Usually   $$m\approx\sqrt{p}$$  )