Lecture 4: Linear Classification

 

Shen Shen

Sept 20, 2024

Intro to Machine Learning

Outline

  • Recap, classification setup 
  • Linear classifiers
    • Separator, normal vector, and separability
  • Linear logistic classifiers 
    • Motivation, sigmoid, and negative log-likelihood loss
  • Multi-class classifiers
    • One-hot encoding, softmax, and cross-entropy

Outline

  • Recap, classification setup
  • Linear classifiers
    • Separator, normal vector, and separability
  • Linear logistic classifiers 
    • Motivation, sigmoid, and negative log-likelihood loss
  • Multi-class classifiers
    • One-hot encoding, softmax, and cross-entropy

💻 Compute/optimize/ train 

Recap:

Learning algorithm 

🧠 ⚙️

Hypothesis class

Hyperparameters

Objective (loss) functions

Regularization

(image adapted from Phillip Isola)

new feature \(x\)

new prediction \(y\)

\(\in \mathbb{R}\)

Training

Testing

(aka inferencing, or predicting)

Recap:

(image adapted from Phillip Isola)

  • closed-form formula
  • gradient descent

Classification Setup

(image adapted from Phillip Isola)

"Fish"

{"Fish", "Grizzly", "Chameleon", ...}

\(\in \)

A discrete set.

(image adapted from Phillip Isola)

new feature \(x\)

new prediction

Outline

  • Recap, classification setup 
  • Linear classifiers
    • Separator, normal vector, and separability
  • Linear logistic classifiers 
    • Motivation, sigmoid, and negative log-likelihood loss
  • Multi-class classifiers
    • One-hot encoding, softmax, and cross-entropy

Linear Classifier

  • Each data point:
    • features \([x_1, x_2, \dots x_d]\) 
    • label \(y \in\) {positive, negative} (or {dog, cat}, {pizza, not pizza}, {+1, 0})
  • A (vanilla, sign-based, binary) linear classifier is parameterized by \([\theta_1, \theta_2, \dots, \theta_d, \theta_0]\)
  • To use a given classifier make prediction:
    • do linear combination: \(z =({\theta_1}x_1 + \theta_2x_2 + \dots + \theta_dx_d) + \theta_0\)
    • predict positive label if \(z>0\), otherwise, negative label.

(vanilla, sign-based, binary)  Linear Classifier

\mathcal{L}_{01}(g, a)=\left\{\begin{array}{ll} 0 & \text { if } \text{guess} = \text{actual} \\ 1 & \text { otherwise } \end{array}\right .
  • Now let's try to learn a linear classifier
  • One natural loss:
  • Combined with the linear classifier hypothesis:
\mathcal{L}_{01}(x^{(i)}, y^{(i)}; \theta, \theta_0)=\left\{\begin{array}{ll} 0 & \text { if } \operatorname{sign}\left(\theta^{\top} x^{(i)}+\theta_0\right) = y^{(i)} \\ 1 & \text { otherwise } \end{array}\right .
  • Very intuitive, and easy to evaluate 😍
    • Induced concept: separability
  • Very hard to optimize (NP-hard) 🥺
    • "Flat" almost everywhere (zero gradient)
    • "Jumps" elsewhere (no gradient)

Outline

  • Recap, classification setup 
  • Linear classifiers
    • Separator, normal vector, and separability
  • Linear logistic classifiers
    • Motivation, sigmoid, and negative log-likelihood loss
  • Multi-class classifiers
    • One-hot encoding, softmax, and cross-entropy

Linear Logistic Classifier

  • Mainly motivated to address the gradient issue in learning a "vanilla" linear classifier
    • The gradient issue is caused by both the 0/1 loss, and the sign functions nested in. 

\mathcal{L}_{01}(x^{(i)}, y^{(i)}; \theta, \theta_0)=\left\{\begin{array}{ll} 0 & \text { if } \operatorname{sign}\left(\theta^{\top} x^{(i)}+\theta_0\right) = y^{(i)} \\ 1 & \text { otherwise } \end{array}\right .
  • But has nice probabilistic interpretation too.
  • As before, let's first look at how to make prediction with a given linear logistic classifier

(Binary) Linear Logistic Classifier

  • Each data point:
    • features \([x_1, x_2, \dots x_d]\)
    • label \(y \in\){positive, negative}
  • A (binary) linear logistic classifier is parameterized by \([\theta_1, \theta_2, \dots, \theta_d, \theta_0]\)
  • To use a given classifier make prediction:
    • do linear combination: \(z =({\theta_1}x_1 + \theta_2x_2 + \dots + \theta_dx_d) + \theta_0\)
    • predict positive label if                                                                                                                

otherwise, negative label.

\sigma(z) = \sigma\left(\theta^{\top} x+\theta_0\right)
= \frac{1}{1+e^{-z}}
= \frac{1}{1+e^{-\left(\theta^{\top} x+\theta_0\right)}}
> 0.5

Sigmoid

: a smooth step function

  • "sandwiched" between 0 and 1 vertically (never 0 or 1 mathematically)

\sigma(z) = \sigma\left(\theta^{\top} x+\theta_0\right)
= \frac{1}{1+e^{-\left(\theta^{\top} x+\theta_0\right)}}
  • monotonic, very nice/elegant gradient (see recitation/hw)

  • \(\theta\), \(\theta_0\) can flip, squeeze, expand, shift horizontally
  •  \(\sigma\left(\cdot\right)\) interpreted as the probability/confidence that feature \(x\) has positive label. Predict positive if

\sigma(z) = \sigma\left(\theta^{\top} x+\theta_0\right)
>0.5

e.g. suppose, wanna predict whether to bike to school.

with given parameters, how do I make prediction?

1 feature: 

\begin{aligned} g(x) & =\sigma\left(\theta x+\theta_0\right) \\ & =\frac{1}{1+\exp \left\{-\left(\theta x+\theta_0\right)\right\}} \end{aligned}

2 features: 

\begin{aligned} g(x) & =\sigma\left(\theta^{\top} x+\theta_0\right) \\ & =\frac{1}{1+\exp \left\{-\left(\theta^{\top} x+\theta_0\right)\right\}} \end{aligned}

(image credit: Tamara Broderick)

Learning a logistic regression classifier 

training data:

😍

🥺

  • Let the labels \(y \in \{+1,0\}\)
g(x)=\sigma\left(\theta x+\theta_0\right)
= -[\text { actual } \cdot \log (\text { guess })+(1-\text { actual }) \cdot \log (1-\text { guess })]
= - \left[y^{(i)} \log g^{(i)}+\left(1-y^{(i)}\right) \log \left(1-g^{(i)}\right)\right]
\mathcal{L}_{\text {nll }}(\text { guess, actual })
g
g

training data:

😍

🥺

= - \left[y^{(i)} \log g^{(i)}+\left(1-y^{(i)}\right) \log \left(1-g^{(i)}\right)\right]
\mathcal{L}_{\text {nll }}(\text { guess, actual })

If \(y^{(i)} = 1\)

😍

🥺

g(x)=\sigma\left(\theta x+\theta_0\right)
g
g

training data:

😍

🥺

= - \left[y^{(i)} \log g^{(i)}+\left(1-y^{(i)}\right) \log \left(1-g^{(i)}\right)\right]
\mathcal{L}_{\text {nll }}(\text { guess, actual })

If \(y^{(i)} = 0\)

😍

🥺

g(x)=\sigma\left(\theta x+\theta_0\right)
g
g

Logistic Regression

  • Minimize using negative-log-likelihood loss:                           \(J_{lr} = \frac{1}{n} \sum_{i=1}^n \mathcal{L}_{\text {nll }}\left(\sigma\left(\theta^{\top} x^{(i)}+\theta_0\right), y^{(i)}\right)\)
  • Convex, differentiable with nice (elegant) gradients 
  • Doesn't have a closed-form solution
  • Can still run gradient descent
  • But, a gotcha: when training data is linearly separable
g(x)=\sigma\left(\theta^T x+\theta_0\right)

Regularized Logistic Regression 

\mathrm{J}\left(\theta, \theta_0 \right)=\left(\frac{1}{n} \sum_{i=1}^n \mathcal{L}_{\mathrm{nll}}\left(\sigma\left(\theta^{\top} x^{(i)}+\theta_0\right), y^{(i)}\right)\right)+\lambda\|\theta\|^2
  • \(\lambda \geq 0\)
  • No regularizing \(\theta_0\) (think: why?)
  • Penalizes being overly certain
  • Objective is still differentiable and convex (gradient descent)

Outline

  • Recap, classification setup 
  • Linear classifiers
    • Separator, normal vector, and separability
  • Linear logistic classifiers 
    • Motivation, sigmoid, and negative log-likelihood loss
  • Multi-class classifiers
    • One-hot encoding, softmax, and cross-entropy

(image adapted from Phillip Isola)

K =3

\(\in \mathbb{R}^{K}\)

One-hot labels

  • Generalizes from binary labels
  • Suppose \(K\) classes

Softmax

z=\theta^{\top} x+\theta_0
z=\theta^{\top} x+\theta_0
g = \sigma(z)=\frac{1}{1+\exp (-z)}

Two classes

\(K\) classes

g =\operatorname{softmax}(z)=\left[\begin{array}{c} \exp \left(z_1\right) / \sum_i \exp \left(z_i\right) \\ \vdots \\ \exp \left(z_K\right) / \sum_i \exp \left(z_i\right) \end{array}\right]

scalar

scalar

\(K\)-by-1

\(K\)-by-1

  • Generalizes sigmoid
  • Applies normalization on \(z\) element-wise

Negative log-likelihood multi-class loss

\mathcal{L}_{\mathrm{nllm}}(\mathrm{g}, \mathrm{y})=-\sum_{\mathrm{k}=1}^{\mathrm{K}} \mathrm{y}_{\mathrm{k}} \cdot \log \left(\mathrm{g}_{\mathrm{k}}\right)
\mathcal{L}_{\mathrm{nll}}(\mathrm{g}, \mathrm{y})= - \left(y \log g +\left(1-y \right) \log \left(1-g \right)\right)
  • Appears as sum of two terms
  • Only one term "activates" for a single data point
  • Appears as sum of \(K\) terms
  • Only one term "activates" for a single data point
  • Generalizes negative log likelihood loss
  • Also known as cross-entropy

Two classes

\(K\) classes

(image adapted from Phillip Isola)

current prediction

\(g=\text{softmax}(\cdot)\)

feature \(x\)

true label \(y\)

\log(g)
y
[0,0,0,0,0,1,0,0, \ldots]

loss \(\mathcal{L}_{\mathrm{nllm}}(\mathrm{g}, \mathrm{y})=-\sum_{\mathrm{k}=1}^{\mathrm{K}} \mathrm{y}_{\mathrm{k}} \cdot \log \left(\mathrm{g}_{\mathrm{k}}\right)\)

(image adapted from Phillip Isola)

feature \(x\)

true label \(y\)

\log(g)
y
[0,0,1,0,0,0,0,0, \ldots]

current prediction

\(g=\text{softmax}(\cdot)\)

loss \(\mathcal{L}_{\mathrm{nllm}}(\mathrm{g}, \mathrm{y})=-\sum_{\mathrm{k}=1}^{\mathrm{K}} \mathrm{y}_{\mathrm{k}} \cdot \log \left(\mathrm{g}_{\mathrm{k}}\right)\)

Classification

Image classification  played a pivotal role in kicking off the current wave of AI enthusiasm

Summary

  • Classification: a supervised learning problem, similar to regression, but where the output/label is in a discrete set.
  • Binary classification: only two possible label values.
  • Linear binary classification: think of \(\theta\) and \(\theta_0\) as defining a d-1 dimensional hyperplane that cuts the d-dimensional feature space into two half-spaces.
  • 0-1 loss: a natural loss function for classification, BUT, hard to optimize.
  • Sigmoid function: motivation and properties.
  • Negative-log-likelihood loss: smoother and has nice probabilistic motivations. We can optimize via (S)GD. 
  • Regularization is still important.
  • The generalization to multi-class via (one-hot encoding, and softmax mechanism)
  • Other ways to generalize to multi-class (see hw/lab)

Thanks!

We'd love to hear your thoughts.