Lecture 4: Linear Classification

Shen Shen

Sept 20, 2024

Intro to Machine Learning

Linear Classifier

Each data point:
- features $[x_1, x_2, \dots x_d]$
- label $y \in$ {positive, negative} (or {dog, cat}, {pizza, not pizza}, {+1, 0})
A (vanilla, sign-based, binary) linear classifier is parameterized by $[\theta_1, \theta_2, \dots, \theta_d, \theta_0]$
To use a given classifier make prediction:
- do linear combination: $z =({\theta_1}x_1 + \theta_2x_2 + \dots + \theta_dx_d) + \theta_0$
- predict positive label if $z>0$ , otherwise, negative label.

(vanilla, sign-based, binary) Linear Classifier

\mathcal{L}_{01}(g, a)=\left\{\begin{array}{ll} 0 & \text { if } \text{guess} = \text{actual} \\ 1 & \text { otherwise } \end{array}\right .

\mathcal{L}_{01}(g, a)=\left\{\begin{array}{ll} 0 & \text { if } \text{guess} = \text{actual} \\ 1 & \text { otherwise } \end{array}\right .

Now let's try to learn a linear classifier
One natural loss:

Combined with the linear classifier hypothesis:

\mathcal{L}_{01}(x^{(i)}, y^{(i)}; \theta, \theta_0)=\left\{\begin{array}{ll} 0 & \text { if } \operatorname{sign}\left(\theta^{\top} x^{(i)}+\theta_0\right) = y^{(i)} \\ 1 & \text { otherwise } \end{array}\right .

\mathcal{L}_{01}(x^{(i)}, y^{(i)}; \theta, \theta_0)=\left\{\begin{array}{ll} 0 & \text { if } \operatorname{sign}\left(\theta^{\top} x^{(i)}+\theta_0\right) = y^{(i)} \\ 1 & \text { otherwise } \end{array}\right .

Very intuitive, and easy to evaluate 😍
- Induced concept: separability
Very hard to optimize (NP-hard) 🥺
- "Flat" almost everywhere (zero gradient)
- "Jumps" elsewhere (no gradient)

Linear Logistic Classifier

Mainly motivated to address the gradient issue in learning a "vanilla" linear classifier
- The gradient issue is caused by both the 0/1 loss, and the sign functions nested in.

\mathcal{L}_{01}(x^{(i)}, y^{(i)}; \theta, \theta_0)=\left\{\begin{array}{ll} 0 & \text { if } \operatorname{sign}\left(\theta^{\top} x^{(i)}+\theta_0\right) = y^{(i)} \\ 1 & \text { otherwise } \end{array}\right .

But has nice probabilistic interpretation too.
As before, let's first look at how to make prediction with a given linear logistic classifier

(Binary) Linear Logistic Classifier

Each data point:
- features $[x_1, x_2, \dots x_d]$
- label $y \in$ {positive, negative}
A (binary) linear logistic classifier is parameterized by $[\theta_1, \theta_2, \dots, \theta_d, \theta_0]$
To use a given classifier make prediction:
- do linear combination: $z =({\theta_1}x_1 + \theta_2x_2 + \dots + \theta_dx_d) + \theta_0$
- predict positive label if

otherwise, negative label.

\sigma(z) = \sigma\left(\theta^{\top} x+\theta_0\right)

\sigma(z) = \sigma\left(\theta^{\top} x+\theta_0\right)

= \frac{1}{1+e^{-z}}

= \frac{1}{1+e^{-z}}

= \frac{1}{1+e^{-\left(\theta^{\top} x+\theta_0\right)}}

= \frac{1}{1+e^{-\left(\theta^{\top} x+\theta_0\right)}}

> 0.5

> 0.5

Sigmoid

: a smooth step function

"sandwiched" between 0 and 1 vertically (never 0 or 1 mathematically)

\sigma(z) = \sigma\left(\theta^{\top} x+\theta_0\right)

\sigma(z) = \sigma\left(\theta^{\top} x+\theta_0\right)

= \frac{1}{1+e^{-\left(\theta^{\top} x+\theta_0\right)}}

= \frac{1}{1+e^{-\left(\theta^{\top} x+\theta_0\right)}}

monotonic, very nice/elegant gradient (see recitation/hw)

$\theta$ , $\theta_0$ can flip, squeeze, expand, shift horizontally

$\sigma\left(\cdot\right)$ interpreted as the probability/confidence that feature $x$ has positive label. Predict positive if

\sigma(z) = \sigma\left(\theta^{\top} x+\theta_0\right)

\sigma(z) = \sigma\left(\theta^{\top} x+\theta_0\right)

>0.5

>0.5

e.g. suppose, wanna predict whether to bike to school.

with given parameters, how do I make prediction?

1 feature:

\begin{aligned} g(x) & =\sigma\left(\theta x+\theta_0\right) \\ & =\frac{1}{1+\exp \left\{-\left(\theta x+\theta_0\right)\right\}} \end{aligned}

\begin{aligned} g(x) & =\sigma\left(\theta x+\theta_0\right) \\ & =\frac{1}{1+\exp \left\{-\left(\theta x+\theta_0\right)\right\}} \end{aligned}

2 features:

\begin{aligned} g(x) & =\sigma\left(\theta^{\top} x+\theta_0\right) \\ & =\frac{1}{1+\exp \left\{-\left(\theta^{\top} x+\theta_0\right)\right\}} \end{aligned}

\begin{aligned} g(x) & =\sigma\left(\theta^{\top} x+\theta_0\right) \\ & =\frac{1}{1+\exp \left\{-\left(\theta^{\top} x+\theta_0\right)\right\}} \end{aligned}

(image credit: Tamara Broderick)

Learning a logistic regression classifier

training data:

😍

🥺

Let the labels $y \in \{+1,0\}$

g(x)=\sigma\left(\theta x+\theta_0\right)

g(x)=\sigma\left(\theta x+\theta_0\right)

= -[\text { actual } \cdot \log (\text { guess })+(1-\text { actual }) \cdot \log (1-\text { guess })]

= -[\text { actual } \cdot \log (\text { guess })+(1-\text { actual }) \cdot \log (1-\text { guess })]

= - \left[y^{(i)} \log g^{(i)}+\left(1-y^{(i)}\right) \log \left(1-g^{(i)}\right)\right]

= - \left[y^{(i)} \log g^{(i)}+\left(1-y^{(i)}\right) \log \left(1-g^{(i)}\right)\right]

\mathcal{L}_{\text {nll }}(\text { guess, actual })

\mathcal{L}_{\text {nll }}(\text { guess, actual })

g

g

training data:

😍

🥺

= - \left[y^{(i)} \log g^{(i)}+\left(1-y^{(i)}\right) \log \left(1-g^{(i)}\right)\right]

= - \left[y^{(i)} \log g^{(i)}+\left(1-y^{(i)}\right) \log \left(1-g^{(i)}\right)\right]

\mathcal{L}_{\text {nll }}(\text { guess, actual })

\mathcal{L}_{\text {nll }}(\text { guess, actual })

If $y^{(i)} = 1$

😍

🥺

g(x)=\sigma\left(\theta x+\theta_0\right)

g(x)=\sigma\left(\theta x+\theta_0\right)

g

g

training data:

😍

🥺

= - \left[y^{(i)} \log g^{(i)}+\left(1-y^{(i)}\right) \log \left(1-g^{(i)}\right)\right]

= - \left[y^{(i)} \log g^{(i)}+\left(1-y^{(i)}\right) \log \left(1-g^{(i)}\right)\right]

\mathcal{L}_{\text {nll }}(\text { guess, actual })

\mathcal{L}_{\text {nll }}(\text { guess, actual })

If $y^{(i)} = 0$

😍

🥺

g(x)=\sigma\left(\theta x+\theta_0\right)

g(x)=\sigma\left(\theta x+\theta_0\right)

g

g

Logistic Regression

Minimize using negative-log-likelihood loss: $J_{lr} = \frac{1}{n} \sum_{i=1}^n \mathcal{L}_{\text {nll }}\left(\sigma\left(\theta^{\top} x^{(i)}+\theta_0\right), y^{(i)}\right)$
Convex, differentiable with nice (elegant) gradients
Doesn't have a closed-form solution
Can still run gradient descent
But, a gotcha: when training data is linearly separable

g(x)=\sigma\left(\theta^T x+\theta_0\right)

g(x)=\sigma\left(\theta^T x+\theta_0\right)

Regularized Logistic Regression

\mathrm{J}\left(\theta, \theta_0 \right)=\left(\frac{1}{n} \sum_{i=1}^n \mathcal{L}_{\mathrm{nll}}\left(\sigma\left(\theta^{\top} x^{(i)}+\theta_0\right), y^{(i)}\right)\right)+\lambda\|\theta\|^2

\mathrm{J}\left(\theta, \theta_0 \right)=\left(\frac{1}{n} \sum_{i=1}^n \mathcal{L}_{\mathrm{nll}}\left(\sigma\left(\theta^{\top} x^{(i)}+\theta_0\right), y^{(i)}\right)\right)+\lambda\|\theta\|^2

$\lambda \geq 0$
No regularizing $\theta_0$ (think: why?)
Penalizes being overly certain
Objective is still differentiable and convex (gradient descent)

(image adapted from Phillip Isola)

K =3

K =3

$\in \mathbb{R}^{K}$

One-hot labels

Generalizes from binary labels
Suppose $K$ classes

Softmax

z=\theta^{\top} x+\theta_0

z=\theta^{\top} x+\theta_0

z=\theta^{\top} x+\theta_0

z=\theta^{\top} x+\theta_0

g = \sigma(z)=\frac{1}{1+\exp (-z)}

g = \sigma(z)=\frac{1}{1+\exp (-z)}

Two classes

$K$ classes

g =\operatorname{softmax}(z)=\left[\begin{array}{c} \exp \left(z_1\right) / \sum_i \exp \left(z_i\right) \\ \vdots \\ \exp \left(z_K\right) / \sum_i \exp \left(z_i\right) \end{array}\right]

g =\operatorname{softmax}(z)=\left[\begin{array}{c} \exp \left(z_1\right) / \sum_i \exp \left(z_i\right) \\ \vdots \\ \exp \left(z_K\right) / \sum_i \exp \left(z_i\right) \end{array}\right]

scalar

$K$ -by-1

Generalizes sigmoid
Applies normalization on $z$ element-wise

Negative log-likelihood multi-class loss

\mathcal{L}_{\mathrm{nllm}}(\mathrm{g}, \mathrm{y})=-\sum_{\mathrm{k}=1}^{\mathrm{K}} \mathrm{y}_{\mathrm{k}} \cdot \log \left(\mathrm{g}_{\mathrm{k}}\right)

\mathcal{L}_{\mathrm{nllm}}(\mathrm{g}, \mathrm{y})=-\sum_{\mathrm{k}=1}^{\mathrm{K}} \mathrm{y}_{\mathrm{k}} \cdot \log \left(\mathrm{g}_{\mathrm{k}}\right)

\mathcal{L}_{\mathrm{nll}}(\mathrm{g}, \mathrm{y})= - \left(y \log g +\left(1-y \right) \log \left(1-g \right)\right)

\mathcal{L}_{\mathrm{nll}}(\mathrm{g}, \mathrm{y})= - \left(y \log g +\left(1-y \right) \log \left(1-g \right)\right)

Appears as sum of two terms
Only one term "activates" for a single data point

Appears as sum of $K$ terms
Only one term "activates" for a single data point

Generalizes negative log likelihood loss
Also known as cross-entropy

Two classes

$K$ classes

(image adapted from Phillip Isola)

current prediction

$g=\text{softmax}(\cdot)$

feature $x$

true label $y$

\log(g)

\log(g)

y

[0,0,0,0,0,1,0,0, \ldots]

[0,0,0,0,0,1,0,0, \ldots]

loss $\mathcal{L}_{\mathrm{nllm}}(\mathrm{g}, \mathrm{y})=-\sum_{\mathrm{k}=1}^{\mathrm{K}} \mathrm{y}_{\mathrm{k}} \cdot \log \left(\mathrm{g}_{\mathrm{k}}\right)$

(image adapted from Phillip Isola)

feature $x$

true label $y$

\log(g)

\log(g)

y

[0,0,1,0,0,0,0,0, \ldots]

[0,0,1,0,0,0,0,0, \ldots]

current prediction

$g=\text{softmax}(\cdot)$

loss $\mathcal{L}_{\mathrm{nllm}}(\mathrm{g}, \mathrm{y})=-\sum_{\mathrm{k}=1}^{\mathrm{K}} \mathrm{y}_{\mathrm{k}} \cdot \log \left(\mathrm{g}_{\mathrm{k}}\right)$

https://introml.mit.edu/ Lecture 4: Linear Classification Shen Shen Sept 20, 2024 Intro to Machine Learning

6.390 IntroML (Fall24) - Lecture 4 Linear Classification

By Shen Shen

6.390 IntroML (Fall24) - Lecture 4 Linear Classification

5 months ago
101

Shen Shen

shenshen.mit.edu

Lecture 4: Linear Classification

Intro to Machine Learning

Outline

Outline

Learning algorithm

Training

Testing

Classification Setup

Outline

Linear Classifier

(vanilla, sign-based, binary) Linear Classifier

Outline

Linear Logistic Classifier

(Binary) Linear Logistic Classifier

Learning a logistic regression classifier

Logistic Regression

Regularized Logistic Regression

Outline

One-hot labels

Softmax

Negative log-likelihood multi-class loss

Classification

Summary

Thanks!

6.390 IntroML (Fall24) - Lecture 4 Linear Classification

6.390 IntroML (Fall24) - Lecture 4 Linear Classification

Shen Shen

Lecture 4: Linear Classification

Intro to Machine Learning

6.390 IntroML (Fall24) - Lecture 4 Linear Classification

More from Shen Shen