Intro to Machine Learning

https://introml.mit.edu/

Lecture 5: Features

Shen Shen

March 1, 2024

(some slides adapted from Tamara Broderick and Phillip Isola)

Midterm exam heads-up

Wednesday, March 20, 730pm-930pm. Everyone will be assigned an exam room.
For conflict and/or accommodations, please be sure to email us by Wednesday, March 6, at 6.390-personal@mit.edu .
Midterm will cover Week 1 till Week 6 (neural networks) materials.
We will use the regular lecture time/room on March 15 (11am-12pm in 10-250) for midterm review session (the session will be recorded).
More details (your exam room, practice exams, exam policy, etc.) will be posted on introML homepage this weekend, along with the typical weekly announcements.

Outline

Recap (linear regression and classification)
Systematic feature transformations
- Polynomial features
Domain-dependent, or goal-dependent, encoding
- Numerical features
  - Standardizing the data
- Categorical features
  - One-hot encoding
  - Factored encoding
  - Thermometer encoding

new

input \(x\)

new

prediction \(y\)

Testing

(predicting)

Recap:

- OLS can have analytical formula and "easy" prediction mechanism

- Regularization

- Cross-validation

- Gradient descent

z = \theta^{\top} x+\theta_0

\{x: \theta^{\top} x+\theta_0>0\}

\{x: \theta^{\top} x+\theta_0<0\}

\{x: \sigma(\theta^{\top} x+\theta_0)>0.5\}

\{x: \sigma(\theta^{\top} x+\theta_0)<0.5\}

(vanilla, sign-based)

linear classifier

linear

logistic regression (classifier)

An aside:

Geometrical understanding of algebraic objects are fundamental to engineering.
Certainly contributed to many ML algorithms, and continue to influence/inspire new ideas.

The idea of "distance" appeared in

Linear regression (MSE)
Logistic regression (data points further from the separator are classified with higher confidence)

it will play a central role in later weeks

Nearest neighbor (non-parametric models for supervised learning)
Clustering (unsupervised learning)

it will play a central role in fundamental algorithms we won't discuss:

Perceptron
Support vector machine

Not linearly separable.
Proposed by Minsky and Papert, 1970s
Caused the first AI winder.

Parallel Distributed Processing (PDP), 1986
Pointed out key ideas (enabling neural networks):
- Nonlinear feature transformation
- "Stacking" transformations
- Backpropogation

(next

week)

Outline

Recap (linear regression and classification)
Systematic feature transformations
- Polynomial features
- Other typical fixed feature transformations
Domain-dependent, or goal-dependent, encoding
- Numerical features
- Categorical features
  - One-hot encoding
  - Factored encoding
  - Thermometer encoding

Polynomial features for classification

Linearly separable in \(\Phi(x) = x^2\) space (e.g., sign(\(1.5-\Phi(x)\)) is one such perfectly-separating classifier)

Not linearly separable in \(x\) space

z = \theta^{\top} x+\theta_0

\{x: \theta^{\top} x+\theta_0>0\}

\{x: \theta^{\top} x+\theta_0<0\}

(vanilla, sign-based)

linear classifier

\{x: f(x)<0\}

z = f(\Phi(x)) \\=\Phi_1+ \Phi_2 \\ = x_1^2 + x_2^2

\{x: f(x)>0\}

using

polynomial feature transformation

Elements in the basis are the monomials of ("original features" raised up to power \(k\))
With a given \(d\) and \(k\), the basis is fixed.

Polynomial Basis Construction

Using polynomial features of order 3

Using high-order polynomial features, we can get very "nuanced" decision boundary.
Training error is 0!
But seems like our classifier is overfitting.
Tension between richness/expressiveness of hypothesis class and generalization.

Polynomial features for regression

9 data points.
Each data has one-dimensional feature \(x \in \mathbb{R}\)
Label \(y \in \mathbb{R}\)

Choose \(k = 1\)
\(h_{\theta}(x) = \theta_0 + \theta_1 x\)
How many scalar parameters to learn?
2 scalar parameters

Choose \(k = 2\)
\(h_{\theta}(x) = \theta_0 + \theta_1 x + \theta_2 x^2\)
How many scalar parameters to learn?
3 scalar parameters

9 data points.
Each data has one-dimensional feature \(x \in \mathbb{R}\)
Label \(y \in \mathbb{R}\)

Polynomial features for regression

Choose \(k = 3\)
\(h_{\theta}(x) = \theta_0 + \theta_1 x + \theta_2 x^2 + \theta_3 x^3\)
How many scalar parameters to learn?
4 scalar parameters

9 data points.
Each data has one-dimensional feature \(x \in \mathbb{R}\)
Label \(y \in \mathbb{R}\)

Polynomial features for regression

Choose \(k = 4\)
\(h_{\theta}(x) = \theta_0 + \theta_1 x + \theta_2 x^2 + \theta_3 x^3 + \theta_4 x^4\)
How many scalar parameters to learn?
5 scalar parameters

9 data points.
Each data has one-dimensional feature \(x \in \mathbb{R}\)
Label \(y \in \mathbb{R}\)

Polynomial features for regression

Choose \(k = 5\)
\(h_{\theta}(x) = \theta_0 + \theta_1 x + \theta_2 x^2 + \dots \theta_k x^k \)
How many scalar parameters to learn?
6 scalar parameters

9 data points.
Each data has one-dimensional feature \(x \in \mathbb{R}\)
Label \(y \in \mathbb{R}\)

Polynomial features for regression

Choose \(k = 6\)
\(h_{\theta}(x) = \theta_0 + \theta_1 x + \theta_2 x^2 + \dots \theta_k x^k \)
How many scalar parameters to learn?
7 scalar parameters

9 data points.
Each data has one-dimensional feature \(x \in \mathbb{R}\)
Label \(y \in \mathbb{R}\)

Polynomial features for regression

Choose \(k = 7\)
\(h_{\theta}(x) = \theta_0 + \theta_1 x + \theta_2 x^2 + \dots \theta_k x^k \)
How many scalar parameters to learn?
8 scalar parameters

9 data points.
Each data has one-dimensional feature \(x \in \mathbb{R}\)
Label \(y \in \mathbb{R}\)

Polynomial features for regression

Choose \(k = 8\)
\(h_{\theta}(x) = \theta_0 + \theta_1 x + \theta_2 x^2 + \dots \theta_k x^k \)
How many scalar parameters to learn?
9 scalar parameters

9 data points.
Each data has one-dimensional feature \(x \in \mathbb{R}\)
Label \(y \in \mathbb{R}\)

Polynomial features for regression

Choose \(k = 9\)
\(h_{\theta}(x) = \theta_0 + \theta_1 x + \theta_2 x^2 + \dots \theta_k x^k \)
How many scalar parameters to learn?
10 scalar parameters

9 data points.
Each data has one-dimensional feature \(x \in \mathbb{R}\)
Label \(y \in \mathbb{R}\)

Polynomial features for regression

Choose \(k = 10\)
\(h_{\theta}(x) = \theta_0 + \theta_1 x + \theta_2 x^2 + \dots \theta_k x^k \)
How many scalar parameters to learn?
11 scalar parameters

The fit is perfect but "wild" (compared with the true function).
Overfitting.
It occurs when we have too expressive of a model (e.g., too many learnable parameters, too few data points to pin these parameters down).

Polynomial features for regression

Underfitting

Appropriate model

Overfitting

high error on train set

high error on test set

low error on train set

low error on test set

very low error on train set

very high error on test set

Underfitting

Appropriate model

Overfitting

\(k\) is a hyperparameter, can control the capacity/expressiveness of the hypothesis class (model class).
Complex models with many rich features and free parameters have high capacity.
How to choose \(k?\) Validation/cross-validation.

Quick summary

Linear models are mathematically and algorithmically convenient but not expressive enough -- by themselves -- for most jobs.
We can express really rich hypothesis classes by performing a fixed non-linear feature transformation first, then applying our linear regression or classification methods.
Can think of these fixed transformation as "adapters", enabling us to use old tool in more situations.
Standard feature transformations: polynomials; radial basis functions, absolute-value function.
Historically, for a period of time, the gist of ML boils down to "feature engineering".
Nowadays, neural networks can automatically assemble features.

Outline

Recap (linear regression and classification)
Systematic feature transformations
- Polynomial features
Domain-dependent, or goal-dependent, encoding
- Numerical features
  - Standardizing the data
- Categorical features
  - One-hot encoding
  - Factored encoding
  - Thermometer encoding

A more-complete/realistic ML analysis

1. Establish a goal & find data

(Example goal: diagnose if people have heart disease based on their available info.)

2. Encode data in useful form for the ML algorithm.
3. Choose a loss, and a regularizer. Write an objective function to optimize

(Example: logistic regression. Loss: negative log likelihood. Regularizer: ridge penalty)

4. Optimize the objective function & return a hypothesis

(Example: analytical/closed-form optimization, sgd)

5. Evaluation & interpretation

A more-complete/realistic ML analysis

1. Establish a goal & find data

(Example goal: diagnose if people have heart disease based on their available info.)

2. Encode data in useful form for the ML algorithm.

Identify relevant info and encode as real numbers

Encode in such a way that's sensible for the task.

First, need goal & data. E.g. diagnose whether people have heart disease based on their available information

y^{(1)}

y^{(2)}

x^{(1)}

x^{(2)}

\dots

Encode data in usable form

Identify the labels and encode as real numbers

Save mapping to recover predictions of new points

)

Resting heart rate and income are real numbers already
Can directly use, but may not want to (see next slide)

\theta_{\substack{\text { heart } \\ \text { rate } }}x_{\substack{\text { heart } \\ \text { rate }}}

\theta_{\substack{\text {pain} \\ \text {} }}x_{\substack{\text {pain } \\ \text {}}}

\theta_{\substack{\text {job} \\ \text {} }}x_{\substack{\text {job} \\ \text {}}}

\theta_{\substack{\text {pill} \\ \text {} }}x_{\substack{\text {pill} \\ \text {}}}

\theta_{\substack{\text {age} \\ \text {} }}x_{\substack{\text {age} \\ \text {}}}

\theta_{\substack{\text {income} \\ \text {} }}x_{\substack{\text {income} \\ \text {}}}

y_{\substack{\text { heart } \\ \text {disease}}} = \text{sign}(

Encoding numerical data

Idea: standardize numerical data
For \(i\)th feature and data point \(j\):

\phi_i^{(j)}=\frac{x_i^{(j)}-\operatorname{mean}_i}{\operatorname{stddev}_i}

)

\theta_{\substack{\text { heart } \\ \text { rate } }}x_{\substack{\text { heart } \\ \text { rate }}}

\theta_{\substack{\text {pain} \\ \text {} }}x_{\substack{\text {pain } \\ \text {}}}

\theta_{\substack{\text {job} \\ \text {} }}x_{\substack{\text {job} \\ \text {}}}

\theta_{\substack{\text {pill} \\ \text {} }}x_{\substack{\text {pill} \\ \text {}}}

\theta_{\substack{\text {age} \\ \text {} }}x_{\substack{\text {age} \\ \text {}}}

\theta_{\substack{\text {income} \\ \text {} }}x_{\substack{\text {income} \\ \text {}}}

y_{\substack{\text { heart } \\ \text {disease}}} = \text{sign}(

)

\theta_{\substack{\text { heart } \\ \text { rate } }}x_{\substack{\text { heart } \\ \text { rate }}}

\theta_{\substack{\text {pain} \\ \text {} }}x_{\substack{\text {pain } \\ \text {}}}

\theta_{\substack{\text {job} \\ \text {} }}x_{\substack{\text {job} \\ \text {}}}

\theta_{\substack{\text {pill} \\ \text {} }}x_{\substack{\text {pill} \\ \text {}}}

\theta_{\substack{\text {age} \\ \text {} }}x_{\substack{\text {age} \\ \text {}}}

\theta_{\substack{\text {income} \\ \text {} }}x_{\substack{\text {income} \\ \text {}}}

y_{\substack{\text { heart } \\ \text {disease}}} = \text{sign}(

What about jobs?

\theta_{\substack{\text { heart } \\ \text { rate } }}x_{\substack{\text { heart } \\ \text { rate }}}

\theta_{\substack{\text {pain} \\ \text {} }}x_{\substack{\text {pain } \\ \text {}}}

\theta_{\substack{\text {job} \\ \text {} }}x_{\substack{\text {job} \\ \text {}}}

\theta_{\substack{\text {pill} \\ \text {} }}x_{\substack{\text {pill} \\ \text {}}}

\theta_{\substack{\text {age} \\ \text {} }}x_{\substack{\text {age} \\ \text {}}}

\theta_{\substack{\text {income} \\ \text {} }}x_{\substack{\text {income} \\ \text {}}}

y_{\substack{\text { heart } \\ \text {disease}}} = \text{sign}(

)

Problem with this idea:
- Ordering would matter
- Incremental in the "job" would matter (by a fixed \(\theta_{\text {job }}\)amount)

\theta_{\substack{\text { heart } \\ \text { rate } }}x_{\substack{\text { heart } \\ \text { rate }}}

\theta_{\substack{\text {pain} \\ \text {} }}x_{\substack{\text {pain } \\ \text {}}}

\theta_{\substack{\text {job} \\ \text {} }}x_{\substack{\text {job} \\ \text {}}}

\theta_{\substack{\text {pill} \\ \text {} }}x_{\substack{\text {pill} \\ \text {}}}

\theta_{\substack{\text {age} \\ \text {} }}x_{\substack{\text {age} \\ \text {}}}

\theta_{\substack{\text {income} \\ \text {} }}x_{\substack{\text {income} \\ \text {}}}

y_{\substack{\text { heart } \\ \text {disease}}} = \text{sign}(

)

\theta_{\text {job1}} \phi_{\text {job1}} + \theta_{\text {job2}} \phi_{\text {job2}} + \theta_{\text {job3}} \phi_{\text {job3}} + \theta_{\text {job4}} \phi_{\text {job4}} +\theta_{\text {job5}} \phi_{\text {job5}}

Better idea: One-hot encoding

)

\theta_{\substack{\text { heart } \\ \text { rate } }}x_{\substack{\text { heart } \\ \text { rate }}}

\theta_{\substack{\text {pain} \\ \text {} }}x_{\substack{\text {pain } \\ \text {}}}

\theta_{\substack{\text {job} \\ \text {} }}x_{\substack{\text {job} \\ \text {}}}

\theta_{\substack{\text {pill} \\ \text {} }}x_{\substack{\text {pill} \\ \text {}}}

\theta_{\substack{\text {age} \\ \text {} }}x_{\substack{\text {age} \\ \text {}}}

\theta_{\substack{\text {income} \\ \text {} }}x_{\substack{\text {income} \\ \text {}}}

y_{\substack{\text { heart } \\ \text {disease}}} = \text{sign}(

What about medicine?

\theta_{\text {combo1}} \phi_{\text {combo1}} + \theta_{\text {combo2}} \phi_{\text {combo2}} + \theta_{\text {combo3}} \phi_{\text {combo3}} + \theta_{\text {combo4}} \phi_{\text {combo4}}

\theta_{\text {combo1}} \phi_{\text {combo1}} + \theta_{\text {combo2}} \phi_{\text {combo2}} + \theta_{\text {combo3}} \phi_{\text {combo3}} + \theta_{\text {combo4}} \phi_{\text {combo4}}

Better idea: factored encoding

\theta_{\text {pain-pill}} \phi_{\text {pain-pill}} + \theta_{\text {beta-pill}} \phi_{\text {beta-pill}}

Recall, if used one-hot, need exact combo in data to learn corresponding parameter

Thermometer encoding

Numerical data: order on data values, and differences in value are meaningful
Categorical data: no order on data values, one-hot
Ordinal data: order on data values, but differences not meaningful

Strongly disagree	Disagree	Neutral	Agree	Strongly agree
1	2	3	4	5

Thermometer encoding

Numerical data: order on data values, and differences in value are meaningful
Categorical data: no order on data values, one-hot
Ordinal data: order on data values, but differences not meaningful

Strongly disagree	Disagree	Neutral	Agree	Strongly agree
1	2	3	4	5

Thermometer encoding

Numerical data: order on data values, and differences in value are meaningful
Categorical data: no order on data values, one-hot
Ordinal data: order on data values, but differences not meaningful

Strongly disagree	Disagree	Neutral	Agree	Strongly agree
1	2	3	4	5

Strongly disagree	Disagree	Neutral	Agree	Strongly agree
1,0,0,0,0	1,1,0,0,0	1,1,1,0,0	1,1,1,1,0	1,1,1,1,1

\theta_{\text {strong-disagree-base}} \phi_{\text {strong-disagree-base}} +

\theta_{\text {slightly-more-agreement}} \phi_{\text {slightly-more-agreement}} +

\theta_{\text {from-disagree-to-neutral}} \phi_{\text {from-disagree-to-neutral}} +

\theta_{\text {from-neutral-to-agree}} \phi_{\text {from-neutral-to-agree}} +

\theta_{\text {from-agree-to-strongly-agree}} \phi_{\text {from-agree-to-strongly-agree}}

\theta_{\text {how-agreed}} \phi_{\text {how-agreed}}

Summary

Linear models are mathematically and algorithmically convenient but not expressive enough -- by themselves -- for most jobs.
We can express really rich hypothesis classes by performing a fixed non-linear feature transformation first, then applying our linear (regression or classification) methods.
When we “set up” a problem to apply ML methods to it, it’s important to encode the inputs in a way that makes it easier for the ML method to exploit the structure.
Foreshadowing of neural networks, in which we will learn complicated continuous feature transformations.

Thanks!

We'd love it for you to share some lecture feedback.