data:image/s3,"s3://crabby-images/176c7/176c746469f89d0898597fd5e5b6ad2ebe8ef0ca" alt=""
Lecture 4: Linear Classification
Shen Shen
Sept 20, 2024
Intro to Machine Learning
data:image/s3,"s3://crabby-images/651e3/651e3e8b658a1bfd61c989e8335ec0810203d560" alt=""
Outline
- Recap, classification setup
- Linear classifiers
- Separator, normal vector, and separability
- Linear logistic classifiers
- Motivation, sigmoid, and negative log-likelihood loss
- Multi-class classifiers
- One-hot encoding, softmax, and cross-entropy
Outline
- Recap, classification setup
- Linear classifiers
- Separator, normal vector, and separability
- Linear logistic classifiers
- Motivation, sigmoid, and negative log-likelihood loss
- Multi-class classifiers
- One-hot encoding, softmax, and cross-entropy
data:image/s3,"s3://crabby-images/a1046/a1046136ce7456405c25c5d98a8432cffaeb351d" alt=""
data:image/s3,"s3://crabby-images/e09dd/e09dda4bce6e351ff84f77a32c99696031f59cdc" alt=""
💻 Compute/optimize/ train
data:image/s3,"s3://crabby-images/4a5cd/4a5cdec9029d8537570cb536a9bf6ca00dfe8a10" alt=""
Recap:
Learning algorithm
🧠 ⚙️
Hypothesis class
Hyperparameters
Objective (loss) functions
Regularization
(image adapted from Phillip Isola)
data:image/s3,"s3://crabby-images/bf575/bf575abb3c13b67cb925ee0886a6d5f024471a64" alt=""
data:image/s3,"s3://crabby-images/d31b0/d31b0d2d6833a5065862c866ea0cf13866da6c87" alt=""
data:image/s3,"s3://crabby-images/c1473/c147353f5c8f0b324a9f0d1453e53f936c72dbf3" alt=""
new feature x
new prediction y
∈R
Training
Testing
(aka inferencing, or predicting)
data:image/s3,"s3://crabby-images/d8a8f/d8a8faf3d02a09ec993fb42a286aa9e94a3599c6" alt=""
Recap:
(image adapted from Phillip Isola)
- closed-form formula
- gradient descent
data:image/s3,"s3://crabby-images/4a5cd/4a5cdec9029d8537570cb536a9bf6ca00dfe8a10" alt=""
data:image/s3,"s3://crabby-images/2a2df/2a2dfdfb5ff125ac0652dedc3c872b99d02be152" alt=""
data:image/s3,"s3://crabby-images/d8a8f/d8a8faf3d02a09ec993fb42a286aa9e94a3599c6" alt=""
data:image/s3,"s3://crabby-images/17f1b/17f1bfc54a2459883be09634bb0cabc10cf7687a" alt=""
data:image/s3,"s3://crabby-images/17f1b/17f1bfc54a2459883be09634bb0cabc10cf7687a" alt=""
data:image/s3,"s3://crabby-images/9d237/9d2373f5b3ff354a3d11f9190af5de4da5cd16dd" alt=""
Classification Setup
(image adapted from Phillip Isola)
data:image/s3,"s3://crabby-images/9d237/9d2373f5b3ff354a3d11f9190af5de4da5cd16dd" alt=""
data:image/s3,"s3://crabby-images/db95c/db95c2a20b325f754992784ad5d30de806344cbf" alt=""
data:image/s3,"s3://crabby-images/db95c/db95c2a20b325f754992784ad5d30de806344cbf" alt=""
data:image/s3,"s3://crabby-images/d9d71/d9d71df414d7a4ba6cc8dc6ff1de8148feb0961a" alt=""
"Fish"
{"Fish", "Grizzly", "Chameleon", ...}
∈
A discrete set.
(image adapted from Phillip Isola)
new feature x
new prediction
Outline
- Recap, classification setup
-
Linear classifiers
- Separator, normal vector, and separability
- Linear logistic classifiers
- Motivation, sigmoid, and negative log-likelihood loss
- Multi-class classifiers
- One-hot encoding, softmax, and cross-entropy
Linear Classifier
- Each data point:
- features [x1,x2,…xd]
- label y∈ {positive, negative} (or {dog, cat}, {pizza, not pizza}, {+1, 0})
- A (vanilla, sign-based, binary) linear classifier is parameterized by [θ1,θ2,…,θd,θ0]
- To use a given classifier make prediction:
- do linear combination: z=(θ1x1+θ2x2+⋯+θdxd)+θ0
- predict positive label if z>0, otherwise, negative label.
(vanilla, sign-based, binary) Linear Classifier
- Now let's try to learn a linear classifier
- One natural loss:
- Combined with the linear classifier hypothesis:
- Very intuitive, and easy to evaluate 😍
- Induced concept: separability
- Very hard to optimize (NP-hard) 🥺
- "Flat" almost everywhere (zero gradient)
- "Jumps" elsewhere (no gradient)
data:image/s3,"s3://crabby-images/d8a8f/d8a8faf3d02a09ec993fb42a286aa9e94a3599c6" alt=""
Outline
- Recap, classification setup
- Linear classifiers
- Separator, normal vector, and separability
-
Linear logistic classifiers
- Motivation, sigmoid, and negative log-likelihood loss
- Multi-class classifiers
- One-hot encoding, softmax, and cross-entropy
Linear Logistic Classifier
- Mainly motivated to address the gradient issue in learning a "vanilla" linear classifier
-
The gradient issue is caused by both the 0/1 loss, and the sign functions nested in.
-
- But has nice probabilistic interpretation too.
-
As before, let's first look at how to make prediction with a given linear logistic classifier
(Binary) Linear Logistic Classifier
- Each data point:
- features [x1,x2,…xd]
- label y∈{positive, negative}
- A (binary) linear logistic classifier is parameterized by [θ1,θ2,…,θd,θ0]
- To use a given classifier make prediction:
- do linear combination: z=(θ1x1+θ2x2+⋯+θdxd)+θ0
- predict positive label if
otherwise, negative label.
data:image/s3,"s3://crabby-images/66be7/66be72cc973a301221f69977ddf60133f04cda31" alt=""
Sigmoid
: a smooth step function
-
"sandwiched" between 0 and 1 vertically (never 0 or 1 mathematically)
-
monotonic, very nice/elegant gradient (see recitation/hw)
- θ, θ0 can flip, squeeze, expand, shift horizontally
-
σ(⋅) interpreted as the probability/confidence that feature x has positive label. Predict positive if
e.g. suppose, wanna predict whether to bike to school.
with given parameters, how do I make prediction?
1 feature:
2 features:
data:image/s3,"s3://crabby-images/a19f4/a19f4b080def255f38b195e6a827f9c3941d4c8b" alt=""
data:image/s3,"s3://crabby-images/78fa0/78fa0e84a531b4a15ede5195d85d705739bea390" alt=""
data:image/s3,"s3://crabby-images/af78f/af78f91a7d68bad24e6bd80652ed5fffde4b916e" alt=""
data:image/s3,"s3://crabby-images/3982f/3982f7843cd03109a6bc4ee635b6b66d843a9ca0" alt=""
(image credit: Tamara Broderick)
data:image/s3,"s3://crabby-images/3982f/3982f7843cd03109a6bc4ee635b6b66d843a9ca0" alt=""
Learning a logistic regression classifier
data:image/s3,"s3://crabby-images/3f519/3f5193d04e6a98ee4413358db55fcbbe288bbf02" alt=""
training data:
data:image/s3,"s3://crabby-images/6e873/6e8738ee62af860c3229b42ca5d6af6c816f769e" alt=""
data:image/s3,"s3://crabby-images/d7736/d7736605b0ebd4c9e38c3fe0c3c154e2e5fff156" alt=""
😍
🥺
- Let the labels y∈{+1,0}
data:image/s3,"s3://crabby-images/eeaff/eeaff8dd86ef99f08bbfa94b68e3c80e685eed76" alt=""
data:image/s3,"s3://crabby-images/3f519/3f5193d04e6a98ee4413358db55fcbbe288bbf02" alt=""
training data:
data:image/s3,"s3://crabby-images/6e873/6e8738ee62af860c3229b42ca5d6af6c816f769e" alt=""
data:image/s3,"s3://crabby-images/d7736/d7736605b0ebd4c9e38c3fe0c3c154e2e5fff156" alt=""
😍
🥺
If y(i)=1
data:image/s3,"s3://crabby-images/2a100/2a10036831b7892b3eef7196f747234d27db5501" alt=""
😍
🥺
data:image/s3,"s3://crabby-images/3f519/3f5193d04e6a98ee4413358db55fcbbe288bbf02" alt=""
training data:
data:image/s3,"s3://crabby-images/6e873/6e8738ee62af860c3229b42ca5d6af6c816f769e" alt=""
data:image/s3,"s3://crabby-images/d7736/d7736605b0ebd4c9e38c3fe0c3c154e2e5fff156" alt=""
😍
🥺
If y(i)=0
😍
🥺
data:image/s3,"s3://crabby-images/5b5d0/5b5d08438da49a8c0fd89a2a44263d9d4f472a20" alt=""
Logistic Regression
- Minimize using negative-log-likelihood loss: Jlr=n1∑i=1nLnll (σ(θ⊤x(i)+θ0),y(i))
- Convex, differentiable with nice (elegant) gradients
- Doesn't have a closed-form solution
- Can still run gradient descent
- But, a gotcha: when training data is linearly separable
data:image/s3,"s3://crabby-images/051b8/051b8305980033caffbbd8f4f1881a3361b440f1" alt=""
data:image/s3,"s3://crabby-images/bb54d/bb54dc3a02520d1fd0496a86a4a5563f56426949" alt=""
data:image/s3,"s3://crabby-images/15a2a/15a2a71764c0502e818fcd425daffd0a6b9543b7" alt=""
data:image/s3,"s3://crabby-images/35368/35368f6860a63cfbc2384285f718fcb7013e68f3" alt=""
data:image/s3,"s3://crabby-images/71810/71810bbcec285cccd9b5305cdb9e541b20fdb2d2" alt=""
Regularized Logistic Regression
data:image/s3,"s3://crabby-images/0b41a/0b41af746226ee86951470a2373bd0c07ea688d3" alt=""
data:image/s3,"s3://crabby-images/18390/1839008ef7edcc12e25a2431b8dd1beeead62200" alt=""
- λ≥0
- No regularizing θ0 (think: why?)
- Penalizes being overly certain
- Objective is still differentiable and convex (gradient descent)
data:image/s3,"s3://crabby-images/ecf5d/ecf5d00e25a59efcf3423b785d8a72df7b30ea7b" alt=""
data:image/s3,"s3://crabby-images/0b41a/0b41af746226ee86951470a2373bd0c07ea688d3" alt=""
data:image/s3,"s3://crabby-images/993db/993dbd0d3d34f83fddc04ae76c1f8fbca0ea06f2" alt=""
data:image/s3,"s3://crabby-images/0b41a/0b41af746226ee86951470a2373bd0c07ea688d3" alt=""
Outline
- Recap, classification setup
- Linear classifiers
- Separator, normal vector, and separability
- Linear logistic classifiers
- Motivation, sigmoid, and negative log-likelihood loss
-
Multi-class classifiers
- One-hot encoding, softmax, and cross-entropy
(image adapted from Phillip Isola)
data:image/s3,"s3://crabby-images/6e6ec/6e6ecc34add73a08d8d7ece3190a0d3ff138bf66" alt=""
data:image/s3,"s3://crabby-images/25601/256019658ae44a843965047d4a71d20f05569f0e" alt=""
data:image/s3,"s3://crabby-images/77c31/77c318ada65587a049abfabf2a31c6eab4c20705" alt=""
data:image/s3,"s3://crabby-images/c7f53/c7f53370a67579287f655bcb5401d256bb4ae6c2" alt=""
∈RK
One-hot labels
- Generalizes from binary labels
- Suppose K classes
Softmax
Two classes
K classes
scalar
scalar
K-by-1
K-by-1
- Generalizes sigmoid
- Applies normalization on z element-wise
Negative log-likelihood multi-class loss
- Appears as sum of two terms
- Only one term "activates" for a single data point
- Appears as sum of K terms
- Only one term "activates" for a single data point
- Generalizes negative log likelihood loss
- Also known as cross-entropy
Two classes
K classes
data:image/s3,"s3://crabby-images/5a838/5a838ae68430301eff9d267c1e6700e221af755c" alt=""
(image adapted from Phillip Isola)
data:image/s3,"s3://crabby-images/628e0/628e0f5147a26e3a38558220a9619dd4dabad194" alt=""
data:image/s3,"s3://crabby-images/3570a/3570a4b4efd730326dd986a68236b6f159929791" alt=""
data:image/s3,"s3://crabby-images/544a5/544a5d9e93fe97157f60dbaf9c939e1cb42f5b66" alt=""
current prediction
g=softmax(⋅)
feature x
data:image/s3,"s3://crabby-images/9c8b9/9c8b912ca2b6982d9c989b1eb642d7c35869cfb4" alt=""
true label y
data:image/s3,"s3://crabby-images/ec542/ec542ed90ae66b5c4ea722fa85d452a47054f805" alt=""
loss Lnllm(g,y)=−∑k=1Kyk⋅log(gk)
data:image/s3,"s3://crabby-images/5a838/5a838ae68430301eff9d267c1e6700e221af755c" alt=""
(image adapted from Phillip Isola)
feature x
data:image/s3,"s3://crabby-images/9c8b9/9c8b912ca2b6982d9c989b1eb642d7c35869cfb4" alt=""
true label y
data:image/s3,"s3://crabby-images/3c272/3c2726c8473fe91576d7eece8edbc04b966f2e9c" alt=""
data:image/s3,"s3://crabby-images/01d26/01d26cea85e67ca18f834abe4707fa1642e17005" alt=""
data:image/s3,"s3://crabby-images/e4755/e4755f77a2e3e8b78c0bb2c584fd9aceb0b99a71" alt=""
data:image/s3,"s3://crabby-images/6341d/6341d1ee9c86b68699611c94c94f050dd20f0df6" alt=""
data:image/s3,"s3://crabby-images/544a5/544a5d9e93fe97157f60dbaf9c939e1cb42f5b66" alt=""
current prediction
g=softmax(⋅)
loss Lnllm(g,y)=−∑k=1Kyk⋅log(gk)
Classification
data:image/s3,"s3://crabby-images/23863/23863c10607ac3ff38ca5702f3490896ddab1eb4" alt=""
data:image/s3,"s3://crabby-images/90ff0/90ff007ce51a646c793dd2d3ef26124d261121c3" alt=""
Image classification played a pivotal role in kicking off the current wave of AI enthusiasm
Summary
- Classification: a supervised learning problem, similar to regression, but where the output/label is in a discrete set.
- Binary classification: only two possible label values.
- Linear binary classification: think of θ and θ0 as defining a d-1 dimensional hyperplane that cuts the d-dimensional feature space into two half-spaces.
- 0-1 loss: a natural loss function for classification, BUT, hard to optimize.
- Sigmoid function: motivation and properties.
- Negative-log-likelihood loss: smoother and has nice probabilistic motivations. We can optimize via (S)GD.
- Regularization is still important.
- The generalization to multi-class via (one-hot encoding, and softmax mechanism)
- Other ways to generalize to multi-class (see hw/lab)
Thanks!
We'd love to hear your thoughts.
6.390 IntroML (Fall24) - Lecture 4 Linear Classification
By Shen Shen
6.390 IntroML (Fall24) - Lecture 4 Linear Classification
- 101