Intro to Machine Learning

https://introml.mit.edu/

Lecture 4: Linear Classification (Logistic Regression)

Shen Shen

Feb 23, 2024

(some slides adapted from Tamara Broderick and Phillip Isola)

Outline

Recap (ML pipeline, regression, regularization, GD)
Classification General Setup
(vanilla) Linear Classifier
- Understand a given linear classifier
- Linear separator: geometric intuition
- Learn a linear classifier via 0-1 loss?
Linear Logistic Regression
- Sigmoid function
- Cross-entropy (negative log likelihood) loss
- Optimizing the loss via gradient descent
- Regularization, cross-validation still matter
Multi-class classification

ML algorithm

Hypothesis class

Hyperparameters

If/how to add regularization

Objective (loss) functions

Compute/optimize

new

input \(x\)

new

prediction \(y\)

Testing

(predicting)

Recap:

- OLS can have analytical formula and "easy" prediction mechanism

- Regularization

- Cross-validation

- Gradient descent

Outline

Recap (ML pipeline, regression, regularization, GD)
Classification General Setup
(vanilla) Linear Classifier
- Understand a given linear classifier
- Linear separator: geometric intuition
- Learn a linear classifier via 0-1 loss?
Linear Logistic Regression
- Sigmoid function
- Cross-entropy (negative log likelihood) loss
- Optimizing the loss via gradient descent
- Regularization, cross-validation still matter
Multi-class classification

Classification Setup

General setup: Labels (and predictions) are in a discrete set

Classification Setup

General setup: Labels (and predictions) are in a discrete set

Outline

Recap (ML pipeline, regression, regularization, GD)
Classification General Setup
(vanilla) Linear Classifier
- Understand a given linear classifier
- Linear separator: geometric intuition
- Learn a linear classifier via 0-1 loss?
Linear Logistic Regression
- Sigmoid function
- Cross-entropy (negative log likelihood) loss
- Optimizing the loss via gradient descent
- Regularization, cross-validation still matter
Multi-class classification

(vanilla) Linear Classifier

General setup: Labels (and predictions) are in a discrete set
Simplest setup: linear binary classification. that is, two possible labels, e.g.\(y \in\){positive, negative} (or {dog, cat}, {pizza, not pizza}, {+1, 0}...)
- given a data point with features \(x_1, x_2, \dots x_d\)
- do some linear combination, calculate \(z =(\theta_1x_1 + \theta_2x_2 + \dots + \theta_dx_d) + \theta_0\)
- make a prediction: predict positive class if \(z>0\) otherwise negative class.
We need to understand what are:
- Linear separator
- Normal vector
- Linear separability

(The demo won't embed in PDF. But the direct link below works.)

0-1 loss

\mathcal{L}_{01}(g, a)=\left\{\begin{array}{ll} 0 & \text { if } \text{guess} = \text{actual} \\ 1 & \text { otherwise } \end{array}\right .

😊 Very intuitive
😊 Easy to evaluate
🥺 Very hard to optimize (NP-hard)
- "Flat" almost everywhere (those local gradient=0, not helpful)
- Has "jumps" elsewhere (don't have gradient there)

(The demo won't embed in PDF. But the direct link below works.)

Outline

Recap (ML pipeline, regression, regularization, GD)
Classification General Setup
(vanilla) Linear Classifier
- Understand a given linear classifier
- Linear separator: geometric intuition
- Learn a linear classifier via 0-1 loss?
Linear Logistic Regression
- Sigmoid function
- Cross-entropy (negative log likelihood) loss
- Optimizing the loss via gradient descent
- Regularization, cross-validation still matter
Multi-class classification

Linear Logistic Regression

Despite regression in the name, really a hypothesis class for classification
Mainly motivated to solve the non-"smooth" issue of "vanilla" linear classifier (where we used sign() function and 0-1 loss)
But has nice probabilistic interpretation too
Concretely, we need to know:
- Sigmoid function
- Cross-entropy (negative log likelihood) loss
- Optimizing the loss via gradient descent
- Regularization, cross-validation still matter

Recall: (Vanilla) Linear Classifier

calculate \(z =(\theta_1x_1 + \theta_2x_2 + \dots + \theta_dx_d) + \theta_0\)
predict positive class if \(z>0\) otherwise negative class.

Linear Logistic Regression

calculate \(z =(\theta_1x_1 + \theta_2x_2 + \dots +\theta_dx_d) + \theta_0\)
"squish" \(z\) with a sigmoid/logistic function: \[g = \sigma(z)=\frac{1}{1+\exp (-z)}\]
predict positive class if \(g>0.5,\) otherwise, negative class.

with some appropriate \(\theta\), \(\theta_0\) can horizontally flip, squeezing, expanding, shift

vertically always monotonically "sandwiched" between 0 and 1 (and never quite get to either 0 or 1)

very nice/elegant gradient

probabilistic interpretation

Comments about sigmoid

e.g. suppose, wanna predict whether to bike to school.

with given parameters, how do I make prediction?

1 feature:

\begin{aligned} g(x) & =\sigma\left(\theta x+\theta_0\right) \\ & =\frac{1}{1+\exp \left\{-\left(\theta x+\theta_0\right)\right\}} \end{aligned}

2 features:

\begin{aligned} g(x) & =\sigma\left(\theta^{\top} x+\theta_0\right) \\ & =\frac{1}{1+\exp \left\{-\left(\theta^{\top} x+\theta_0\right)\right\}} \end{aligned}

Learning a logistic regression classifier

training data:

😍

🥺

Suppose labels \(y \in \{+1,0\}\)
When see a training datum \(i\) with \(y^{(i)}=1\), would like \(g^{(i)}\) be high
When see a training datum \(i\) with \(y^{(i)}=0\), would like \(1 - g^{(i)}\) be high
i.e. for \(i\)th training data point, want this probability (likelihood) \[\begin{cases}g^{(i)} & \text { if } y^{(i)}=1 \\ 1-g^{(i)} & \text { if } y^{(i)}=0 \end{cases}\] to be high.
or, equivalently, want \(g^{(i) y^{(i)}}\left(1-g^{(i)}\right)^{1-y^{(i)}}\) to be high

g(x)=\sigma\left(\theta x+\theta_0\right)

Learning a logistic regression classifier

training data:

😍

🥺

Suppose labels \(y \in \{+1,0\}\)
For training data point \(i,\) would like \(g^{(i) y^{(i)}}\left(1-g^{(i)}\right)^{1-y^{(i)}}\) to be high
As logarithm is monotonic, would like \(y^{(i)} \log g^{(i)}+\left(1-y^{(i)}\right) \log \left(1-g^{(i)}\right)\) to be high
Add a negative sign, to turn the above into a loss \[\mathcal{L}_{\text {nll }}(g^{(i)}, y^{(i)}) = \mathcal{L}_{\text {nll }}(\text { guess, actual })=\\-(\text { actual } \cdot \log (\text { guess })+(1-\text { actual }) \cdot \log (1-\text { guess }))\]
Want the above to be low for all data points, under i.i.d. assumption, equivalently, wanna minimize \(J_{lr} =\frac{1}{n}\sum_{i=1}^n \mathcal{L}_{\text {nll }}\left(g^{(i)}, y^{(i)}\right) =\frac{1}{n} \sum_{i=1}^n \mathcal{L}_{\text {nll }}\left(\sigma\left(\theta^{\top} x^{(i)}+\theta_0\right), y^{(i)}\right)\)

g(x)=\sigma\left(\theta x+\theta_0\right)

Comments about \(J_{lr} = \frac{1}{n} \sum_{i=1}^n \mathcal{L}_{\text {nll }}\left(\sigma\left(\theta^{\top} x^{(i)}+\theta_0\right), y^{(i)}\right)\)

Also called cross-entropy loss
Convex, differentiable with nice (elegant) gradients
Doesn't have a closed-form solution
Can still run gradient descent
But, a gotcha: when training data is linearly separable

g(x)=\sigma\left(\theta^T x+\theta_0\right)

Regularization for Logistic Regression

\mathrm{J}_{\operatorname{lr}}\left(\theta, \theta_0 ; \mathcal{D}\right)=\left(\frac{1}{n} \sum_{i=1}^n \mathcal{L}_{\mathrm{nll}}\left(\sigma\left(\theta^{\top} x^{(i)}+\theta_0\right), y^{(i)}\right)\right)+\lambda\|\theta\|^2

\(\lambda \geq 0\)
No regularizing \(\theta_0\) (think: why?)
Penalizes being overly certain
Objective is still differentiable & convex (gradient descent)

Outline

Recap (ML pipeline, regression, regularization, GD)
Classification General Setup
(vanilla) Linear Classifier
- Understand a given linear classifier
- Linear separator: geometric intuition
- Learn a linear classifier via 0-1 loss?
Linear Logistic Regression
- Sigmoid function
- Cross-entropy (negative log likelihood) loss
- Optimizing the loss via gradient descent
- Regularization, cross-validation still matter
Multi-class classification

How to represent class labels?

Suppose \(K\) classes, then it's convenient to let y be a \(K\)-dimensional one-hot vector

Generalize sigmoid to softmax

Generalize NLL to NLL multi-class (NLLM, or just cross-entropy)

Every data point incur a scalar loss:

z=\theta^{\top} x+\theta_0

g = \sigma(z)=\frac{1}{1+\exp (-z)}

two classes

\(K\) classes

g =\operatorname{softmax}(z)=\left[\begin{array}{c} \exp \left(z_1\right) / \sum_i \exp \left(z_i\right) \\ \vdots \\ \exp \left(z_K\right) / \sum_i \exp \left(z_i\right) \end{array}\right]

scalar

\mathcal{L}_{\mathrm{nllm}}(\mathrm{g}, \mathrm{y})=-\sum_{\mathrm{k}=1}^{\mathrm{K}} \mathrm{y}_{\mathrm{k}} \cdot \log \left(\mathrm{g}_{\mathrm{k}}\right)

\mathcal{L}_{\mathrm{nll}}(\mathrm{g}, \mathrm{y})= - \left(y \log g +\left(1-y \right) \log \left(1-g \right)\right)

scalar

\(K\)-by-1

\mathcal{L}_{\mathrm{nllm}}(\mathrm{g}, \mathrm{y})=-\sum_{\mathrm{k}=1}^{\mathrm{K}} \mathrm{y}_{\mathrm{k}} \cdot \log \left(\mathrm{g}_{\mathrm{k}}\right)

Summary

Classification is a supervised learning problem, similar to regression, but where the output/label is in a discrete set
Binary classification: only two possible label values
Linear binary classification: think of theta and theta-0 as defining a d-1 dimensional hyperplane that cuts the d-dimensional input space into two half-spaces. (This is hard conceptually!)
0-1 loss is a natural loss function for classification, BUT, hard to optimize. (Non-smooth; zero-gradient)
NLL is smoother and has nice probabilistic motivations. We can optimize using gradient descent!
Regularization is still important.
Generalizes to multi-class.

Thanks!

We'd love it for you to share some lecture feedback.