Intro to Machine Learning


Lecture 9: Non-parametric Models
Shen Shen
April 12, 2024
(many slides adapted from Tamara Broderick)
Outline
- Recap (transforermers)
- Non-parametric models
- interpretability
- ease of use/simplicity
- Decision tree
- Terminologies
- Learn via the BuildTree algorithm
- Regression
- Classification
- Nearest neighbor
Outline
- Recap (transforermers)
- Non-parametric models
- interpretability
- ease of use/simplicity
- Decision tree
- Terminologies
- Learn via the BuildTree algorithm
- Regression
- Classification
- Nearest neighbor

Enduring principles:
- Chop up signal into patches (divide and conquer)
- Process each patch identically (and in parallel)
Lessons from CNNs

CNN
- Importantly, all these learned projection weights W are shared along the token sequence.
- Same "operation" repeated.
命








運
我




操




縱




Transformers


Interpretability
Outline
- Recap (transforermers)
- Non-parametric models
- interpretability
- ease of use/simplicity
- Decision tree
- Terminologies
- Learn via the BuildTree algorithm
- Regression
- Classification
- Nearest neighbor
- does not mean "no parameters"
- there are still parameters to be learned to build a hypothesis/model.
- just that, the model/hypothesis does not have a fixed parameterization.
- (e.g. even the number of parameters can change.)
Non-parametric models
- Decision trees and
- Nearest neighbor
are the classical examples of non-parametric models
Outline
- Recap (transforermers)
- Non-parametric models
- interpretability
- ease of use/simplicity
- Decision tree
- Terminologies
- Learn via the BuildTree algorithm
- Regression
- Classification
- Nearest neighbor

features:
x1: date
x2: age
x3: height
x4: weight
x5: sinus tachycardia?
x6: min systolic bp, 24h
x7: latest diastolic bp
labels:
1: high risk
-1: low risk

Root node
Internal (decision) node
Leaf (terminal) node

Split dimension
Split value
A node can be specified by
Node(split dim, split value, left child, right child)

A leaf can be specified by
Leaf(leaf value)
features:
- x1: temperature (deg C)
- x2: precipitation (cm/hr)
labels:
y: km run
Tree defines an axis-aligned “partition” of the feature space:












How to learn a tree?

Recall: familiar "recipe"
- Choose how to predict label (given features & parameters)
- Choose a loss (between guess & actual label)
- Choose parameters by trying to minimize the training loss
Here, we need:
- For each internal node:
- split dimension
- split value
- child nodes
- For each leaf node:
- label
- input I: set of indices
- k: hyper-parameter, maximum leaf "size", i.e. how many training data ended in that leaf node.
- y^: (intermediate) prediction
BuildTree(I;k)
- if ∣I∣>k
- for each split dim j and split value s
- Set Ij,s+={i∈I∣xj(i)≥s}
- Set Ij,s−={i∈I∣xj(i)<s}
- Set y^j,s+=average i∈Ij,s+ y(i)
- Set y^j,s−=average i∈Ij,s− y(i)
- Set Ej,s=∑i∈Ij,s+(y(i)−y^j,s+)2+∑i∈Ij,s−(y(i)−y^j,s−)2
- Set (j∗,s∗)=argminj,sEj,s
- else
- Set y^= average i∈I y(i)
- return LEAF(leave_value=y^)
- return Node(j∗,s∗,BuildTree(Ij∗,s∗−,k),BuildTree(Ij∗,s∗+,k))
- j: split dimension
- s: split value


BuildTree(I;k)
- if ∣I∣>k
- for each split dim j and split value s
- Set Ij,s+={i∈I∣xj(i)≥s}
- Set Ij,s−={i∈I∣xj(i)<s}
- Set y^j,s+=average i∈Ij,s+ y(i)
- Set y^j,s−=average i∈Ij,s− y(i)
- Set Ej,s=∑i∈Ij,s+(y(i)−y^j,s+)2+∑i∈Ij,s−(y(i)−y^j,s−)2
- Set (j∗,s∗)=argminj,sEj,s
- else
- Set y^= average i∈I y(i)
- return LEAF(leave_value=y^)
- return Node(j∗,s∗,BuildTree(Ij∗,s∗−,k),BuildTree(Ij∗,s∗+,k))
- Choose k=2
- BuildTree({1,2,3};2)
- Line 1 true
- Consider a fixed (j,s)
- Ij,s+={2,3}
- Ij,s−={1}
- y^j,s+=5
- y^j,s−=0
- Ej,s=0


BuildTree(I;k)
- if ∣I∣>k
- for each split dim j and split value s
- Set Ij,s+={i∈I∣xj(i)≥s}
- Set Ij,s−={i∈I∣xj(i)<s}
- Set y^j,s+=average i∈Ij,s+ y(i)
- Set y^j,s−=average i∈Ij,s− y(i)
- Set Ej,s=∑i∈Ij,s+(y(i)−y^j,s+)2+∑i∈Ij,s−(y(i)−y^j,s−)2
- Set (j∗,s∗)=argminj,sEj,s
- else
- Set y^= average i∈I y(i)
- return LEAF(leave_value=y^)
- return Node(j∗,s∗,BuildTree(Ij∗,s∗−,k),BuildTree(Ij∗,s∗+,k))
- Choose k=2
- BuildTree({1,2,3};2)
- Line 1 true
- Consider a fixed (j,s)
- Ij,s+={2,3}
- Ij,s−={1}
- y^j,s+=5
- y^j,s−=0
- Ej,s=0
BuildTree(I;k)
- if ∣I∣>k
- for each split dim j and split value s
- Set Ij,s+={i∈I∣xj(i)≥s}
- Set Ij,s−={i∈I∣xj(i)<s}
- Set y^j,s+=average i∈Ij,s+ y(i)
- Set y^j,s−=average i∈Ij,s− y(i)
- Set Ej,s=∑i∈Ij,s+(y(i)−y^j,s+)2+∑i∈Ij,s−(y(i)−y^j,s−)2
- Set (j∗,s∗)=argminj,sEj,s
- else
- Set y^= average i∈I y(i)
- return LEAF(leave_value=y^)
- return Node(j∗,s∗,BuildTree(Ij∗,s∗−,k),BuildTree(Ij∗,s∗+,k))

- So for line 2: a finite number of (j,s) combo suffices (those splits in-between data points)
- Line 8 picks the "best" among these finite combos. (random tie-breaking)


BuildTree(I;k)
- if ∣I∣>k
- for each split dim j and split value s
- Set Ij,s+={i∈I∣xj(i)≥s}
- Set Ij,s−={i∈I∣xj(i)<s}
- Set y^j,s+=average i∈Ij,s+ y(i)
- Set y^j,s−=average i∈Ij,s− y(i)
- Set Ej,s=∑i∈Ij,s+(y(i)−y^j,s+)2+∑i∈Ij,s−(y(i)−y^j,s−)2
- Set (j∗,s∗)=argminj,sEj,s
- else
- Set y^= average i∈I y(i)
- return LEAF(leave_value=y^)
- return Node(j∗,s∗,BuildTree(Ij∗,s∗−,k),BuildTree(Ij∗,s∗+,k))
Suppose line 8 sets this (j∗,s∗)=(1,1.7)

then 12 recursion


BuildTree(I;k)
- if ∣I∣>k
- for each split dim j and split value s
- Set Ij,s+={i∈I∣xj(i)≥s}
- Set Ij,s−={i∈I∣xj(i)<s}
- Set y^j,s+=average i∈Ij,s+ y(i)
- Set y^j,s−=average i∈Ij,s− y(i)
- Set Ej,s=∑i∈Ij,s+(y(i)−y^j,s+)2+∑i∈Ij,s−(y(i)−y^j,s−)2
- Set (j∗,s∗)=argminj,sEj,s
- else
- Set y^= average i∈I y(i)
- return LEAF(leave_value=y^)
- return Node(j∗,s∗,BuildTree(Ij∗,s∗−,k),BuildTree(Ij∗,s∗+,k))
Line 8 sets this (j∗,s∗)

Line 12 recursion





BuildTree(I;k)
- if ∣I∣>k
- for each split dim j and split value s
- Set Ij,s+={i∈I∣xj(i)≥s}
- Set Ij,s−={i∈I∣xj(i)<s}
- Set y^j,s+=average i∈Ij,s+ y(i)
- Set y^j,s−=average i∈Ij,s− y(i)
- Set Ej,s=∑i∈Ij,s+(y(i)−y^j,s+)2+∑i∈Ij,s−(y(i)−y^j,s−)2
- Set (j∗,s∗)=argminj,sEj,s
- else
- Set y^= average i∈I y(i)
- return LEAF(leave_value=y^)
- return Node(j∗,s∗,BuildTree(Ij∗,s∗−,k),BuildTree(Ij∗,s∗+,k))
Line 8 sets this (j∗,s∗)

Line 12 recursion






BuildTree(I;k)
- if ∣I∣>k
- for each split dim j and split value s
- Set Ij,s+={i∈I∣xj(i)≥s}
- Set Ij,s−={i∈I∣xj(i)<s}
- Set y^j,s+=average i∈Ij,s+ y(i)
- Set y^j,s−=average i∈Ij,s− y(i)
- Set Ej,s=∑i∈Ij,s+(y(i)−y^j,s+)2+∑i∈Ij,s−(y(i)−y^j,s−)2
- Set (j∗,s∗)=argminj,sEj,s
- else
- Set y^= average i∈I y(i)
- return LEAF(leave_value=y^)
- return Node(j∗,s∗,BuildTree(Ij∗,s∗−,k),BuildTree(Ij∗,s∗+,k))
Line 8 sets this (j∗,s∗)

Line 12 recursion






BuildTree(I;k)
- if ∣I∣>k
- for each split dim j and split value s
- Set Ij,s+={i∈I∣xj(i)≥s}
- Set Ij,s−={i∈I∣xj(i)<s}
- Set y^j,s+=average i∈Ij,s+ y(i)
- Set y^j,s−=average i∈Ij,s− y(i)
- Set Ej,s=∑i∈Ij,s+(y(i)−y^j,s+)2+∑i∈Ij,s−(y(i)−y^j,s−)2
- Set (j∗,s∗)=argminj,sEj,s
- else
- Set y^= average i∈I y(i)
- return LEAF(leave_value=y^)
- return Node(j∗,s∗,BuildTree(Ij∗,s∗−,k),BuildTree(Ij∗,s∗+,k))
Line 8 sets this (j∗,s∗)

Line 12 recursion







BuildTree(I;k)
- if ∣I∣>k
- for each split dim j and split value s
- Set Ij,s+={i∈I∣xj(i)≥s}
- Set Ij,s−={i∈I∣xj(i)<s}
- Set y^j,s+=average i∈Ij,s+ y(i)
- Set y^j,s−=average i∈Ij,s− y(i)
- Set Ej,s=∑i∈Ij,s+(y(i)−y^j,s+)2+∑i∈Ij,s−(y(i)−y^j,s−)2
- Set (j∗,s∗)=argminj,sEj,s
- else
- Set y^= average i∈I y(i)
- return LEAF(leave_value=y^)
- return Node(j∗,s∗,BuildTree(Ij∗,s∗−,k),BuildTree(Ij∗,s∗+,k))





BuildTree(I;k)
- if ∣I∣>k
- for each split dim j and split value s
- Set Ij,s+={i∈I∣xj(i)≥s}
- Set Ij,s−={i∈I∣xj(i)<s}
- Set y^j,s+=average i∈Ij,s+ y(i)
- Set y^j,s−=average i∈Ij,s− y(i)
- Set Ej,s=∑i∈Ij,s+(y(i)−y^j,s+)2+∑i∈Ij,s−(y(i)−y^j,s−)2
- Set (j∗,s∗)=argminj,sEj,s
- else
- Set y^= average i∈I y(i)
- return LEAF(leave_value=y^)
- return Node(j∗,s∗,BuildTree(Ij∗,s∗−,k),BuildTree(Ij∗,s∗+,k))
BuildTree(I;k)
- if ∣I∣>k
- for each split dim j and split value s
- Set Ij,s+={i∈I∣xj(i)≥s}
- Set Ij,s−={i∈I∣xj(i)<s}
- Set. y^j,s+=average i∈Ij,s+ y(i)
- Set y^j,s−=average i∈Ij,s− y(i)
- Set Ej,s=∑i∈Ij,s+(y(i)−y^j,s+)2+∑i∈Ij,s−(y(i)−y^j,s−)2
- Set (j∗,s∗)=argminj,sEj,s
- else
- Set y^= average i∈I y(i)
- return LEAF(leave_value=y^)
- return Node(j∗,s∗,BuildTree(Ij∗,s∗−,k),BuildTree(Ij∗,s∗+,k))




BuildTree(I;k)
- if ∣I∣>k
- for each split dim j and split value s
- Set Ij,s+={i∈I∣xj(i)≥s}
- Set Ij,s−={i∈I∣xj(i)<s}
- Set. y^j,s+=average i∈Ij,s+ y(i)
- Set y^j,s−=average i∈Ij,s− y(i)
- Set Ej,s=∑i∈Ij,s+(y(i)−y^j,s+)2+∑i∈Ij,s−(y(i)−y^j,s−)2
- Set (j∗,s∗)=argminj,sEj,s
- else
- Set y^= average i∈I y(i)
- return LEAF(leave_value=y^)
- return Node(j∗,s∗,BuildTree(Ij∗,s∗−,k),BuildTree(Ij∗,s∗+,k))





BuildTree(I;k)
- if ∣I∣>k
- for each split dim j and split value s
- Set Ij,s+={i∈I∣xj(i)≥s}
- Set Ij,s−={i∈I∣xj(i)<s}
- Set. y^j,s+=average i∈Ij,s+ y(i)
- Set y^j,s−=average i∈Ij,s− y(i)
- Set Ej,s=∑i∈Ij,s+(y(i)−y^j,s+)2+∑i∈Ij,s−(y(i)−y^j,s−)2
- Set (j∗,s∗)=argminj,sEj,s
- else
- Set y^= average i∈I y(i)
- return LEAF(leave_value=y^)
- return Node(j∗,s∗,BuildTree(Ij∗,s∗−,k),BuildTree(Ij∗,s∗+,k))





BuildTree(I;k)
- if ∣I∣>k
- for each split dim j and split value s
- Set Ij,s+={i∈I∣xj(i)≥s}
- Set Ij,s−={i∈I∣xj(i)<s}
- Set. y^j,s+=average i∈Ij,s+ y(i)
- Set y^j,s−=average i∈Ij,s− y(i)
- Set Ej,s=∑i∈Ij,s+(y(i)−y^j,s+)2+∑i∈Ij,s−(y(i)−y^j,s−)2
- Set (j∗,s∗)=argminj,sEj,s
- else
- Set y^= average i∈I y(i)
- return LEAF(leave_value=y^)
- return Node(j∗,s∗,BuildTree(Ij∗,s∗−,k),BuildTree(Ij∗,s∗+,k))




BuildTree(I;k)
- if ∣I∣>k
- for each split dim j and split value s
- Set Ij,s+={i∈I∣xj(i)≥s}
- Set Ij,s−={i∈I∣xj(i)<s}
- Set. y^j,s+=average i∈Ij,s+ y(i)
- Set y^j,s−=average i∈Ij,s− y(i)
- Set Ej,s=∑i∈Ij,s+(y(i)−y^j,s+)2+∑i∈Ij,s−(y(i)−y^j,s−)2
- Set (j∗,s∗)=argminj,sEj,s
- else
- Set y^= average i∈I y(i)
- return LEAF(leave_value=y^)
- return Node(j∗,s∗,BuildTree(Ij∗,s∗−,k),BuildTree(Ij∗,s∗+,k))





BuildTree(I;k)
- if ∣I∣>k
- for each split dim j and split value s
- Set Ij,s+={i∈I∣xj(i)≥s}
- Set Ij,s−={i∈I∣xj(i)<s}
- Set. y^j,s+=average i∈Ij,s+ y(i)
- Set y^j,s−=average i∈Ij,s− y(i)
- Set Ej,s=∑i∈Ij,s+(y(i)−y^j,s+)2+∑i∈Ij,s−(y(i)−y^j,s−)2
- Set (j∗,s∗)=argminj,sEj,s
- else
- Set y^= average i∈I y(i)
- return LEAF(leave_value=y^)
- return Node(j∗,s∗,BuildTree(Ij∗,s∗−,k),BuildTree(Ij∗,s∗+,k))




BuildTree(I;k)
- if ∣I∣>k
- for each split dim j and split value s
- Set Ij,s+={i∈I∣xj(i)≥s}
- Set Ij,s−={i∈I∣xj(i)<s}
- Set. y^j,s+=average i∈Ij,s+ y(i)
- Set y^j,s−=average i∈Ij,s− y(i)
- Set Ej,s=∑i∈Ij,s+(y(i)−y^j,s+)2+∑i∈Ij,s−(y(i)−y^j,s−)2
- Set (j∗,s∗)=argminj,sEj,s
- else
- Set y^= average i∈I y(i)
- return LEAF(leave_value=y^)
- return Node(j∗,s∗,BuildTree(Ij∗,s∗−,k),BuildTree(Ij∗,s∗+,k))




BuildTree(I;k)
- if ∣I∣>k
- for each split dim j and split value s
- Set Ij,s+={i∈I∣xj(i)≥s}
- Set Ij,s−={i∈I∣xj(i)<s}
- Set. y^j,s+=average i∈Ij,s+ y(i)
- Set y^j,s−=average i∈Ij,s− y(i)
- Set Ej,s=∑i∈Ij,s+(y(i)−y^j,s+)2+∑i∈Ij,s−(y(i)−y^j,s−)2
- Set (j∗,s∗)=argminj,sEj,s
- else
- Set y^= average i∈I y(i)
- return LEAF(leave_value=y^)
- return Node(j∗,s∗,BuildTree(Ij∗,s∗−,k),BuildTree(Ij∗,s∗+,k))




BuildTree(I;k)
- if ∣I∣>k
- for each split dim j and split value s
- Set Ij,s+={i∈I∣xj(i)≥s}
- Set Ij,s−={i∈I∣xj(i)<s}
- Set. y^j,s+=average i∈Ij,s+ y(i)
- Set y^j,s−=average i∈Ij,s− y(i)
- Set Ej,s=∑i∈Ij,s+(y(i)−y^j,s+)2+∑i∈Ij,s−(y(i)−y^j,s−)2
- Set (j∗,s∗)=argminj,sEj,s
- else
- Set y^= average i∈I y(i)
- return LEAF(leave_value=y^)
- return Node(j∗,s∗,BuildTree(Ij∗,s∗−,k),BuildTree(Ij∗,s∗+,k))




BuildTree(I;k)
- if ∣I∣>k
- for each split dim j and split value s
- Set Ij,s+={i∈I∣xj(i)≥s}
- Set Ij,s−={i∈I∣xj(i)<s}
- Set. y^j,s+=average i∈Ij,s+ y(i)
- Set y^j,s−=average i∈Ij,s− y(i)
- Set Ej,s=∑i∈Ij,s+(y(i)−y^j,s+)2+∑i∈Ij,s−(y(i)−y^j,s−)2
- Set (j∗,s∗)=argminj,sEj,s
- else
- Set y^= average i∈I y(i)
- return LEAF(leave_value=y^)
- return Node(j∗,s∗,BuildTree(Ij∗,s∗−,k),BuildTree(Ij∗,s∗+,k))




BuildTree(I;k)
- if ∣I∣>k
- for each split dim j and split value s
- Set Ij,s+={i∈I∣xj(i)≥s}
- Set Ij,s−={i∈I∣xj(i)<s}
- Set. y^j,s+=average i∈Ij,s+ y(i)
- Set y^j,s−=average i∈Ij,s− y(i)
- Set Ej,s=∑i∈Ij,s+(y(i)−y^j,s+)2+∑i∈Ij,s−(y(i)−y^j,s−)2
- Set (j∗,s∗)=argminj,sEj,s
- else
- Set y^= average i∈I y(i)
- return LEAF(leave_value=y^)
- return Node(j∗,s∗,BuildTree(Ij∗,s∗−,k),BuildTree(Ij∗,s∗+,k))




BuildTree(I;k)
- if ∣I∣>k
- for each split dim j and split value s
- Set Ij,s+={i∈I∣xj(i)≥s}
- Set Ij,s−={i∈I∣xj(i)<s}
- Set y^j,s+=average i∈Ij,s+ y(i)
- Set y^j,s−=average i∈Ij,s− y(i)
- Set Ej,s=∑i∈Ij,s+(y(i)−y^j,s+)2+∑i∈Ij,s−(y(i)−y^j,s−)2
- Set (j∗,s∗)=argminj,sEj,s
- else
- Set y^= average i∈I y(i)
- return LEAF(leave_value=y^)
- return Node(j∗,s∗,BuildTree(Ij∗,s∗−,k),BuildTree(Ij∗,s∗+,k))




BuildTree(I;k)
- if ∣I∣>k
- for each split dim j and split value s
- Set Ij,s+={i∈I∣xj(i)≥s}
- Set Ij,s−={i∈I∣xj(i)<s}
- Set y^j,s+=average i∈Ij,s+ y(i)
- Set y^j,s−=average i∈Ij,s− y(i)
- Set Ej,s=∑i∈Ij,s+(y(i)−y^j,s+)2+∑i∈Ij,s−(y(i)−y^j,s−)2
- Set (j∗,s∗)=argminj,sEj,s
- else
- Set y^= average i∈I y(i)
- return LEAF(leave_value=y^)
- return Node(j∗,s∗,BuildTree(Ij∗,s∗−,k),BuildTree(Ij∗,s∗+,k))




BuildTree(I;k)
- if ∣I∣>k
- for each split dim j and split value s
- Set Ij,s+={i∈I∣xj(i)≥s}
- Set Ij,s−={i∈I∣xj(i)<s}
- Set y^j,s+= majority i∈Ij,s+ y(i)
- Set y^j,s−= majority i∈Ij,s− y(i)
- Set Ej,s=∣I∣∣Ij,s−∣⋅H(Ij,s−)+∣I∣∣Ij,s+∣⋅H(Ij,s+)
- Set (j∗,s∗)=argminj,sEj,s
- else
- Set y^= majority i∈I y(i)
- return LEAF(leave_value=y^)
- return Node(j∗,s∗,BuildTree(Ij∗,s∗−;k),BuildTree(Ij∗,s∗+;k))
The only change from regression to classification:
- Line 5, 6, 10, average becomes majority vote
- Line 7 error more involved
Ej,s=∣I∣∣Ij,s−∣⋅H(Ij,s−)+∣I∣∣Ij,s+∣⋅H(Ij,s+)
- I = 9, Ij,s− = 6, Ij,s+ = 3
- So, Ej,s=96H(Ij,s−)+93 H(Ij,s−)
H(Ij,s−)=−[63log2(63)+62log2(62)+61log2(61)]
H(Ij,s+)=−[31 log(31)+30log2(30)+32log2(32)]
H=−∑class cP^c(log2P^c)





- One of multiple ways to make and use an ensemble
- Bagging = Bootstrap aggregating
- Training data Dn
Bagging


- One of multiple ways to make and use an ensemble
- Bagging = Bootstrap aggregating
- Training data Dn
- For b=1,…,B
- Draw a new "data set" D~n(b) of size n by sampling with replacement from Dn
- Train a predictor f^(b) on D~n(b)
- Return
- For regression: f^bag (x)=B1∑b=1Bf^(b)(x)
- For classification: predictor at a point is class with highest vote count at that point
Bagging

Outline
- Recap (transforermers)
- Non-parametric models
- interpretability
- ease of use/simplicity
- Decision tree
- Terminologies
- Learn via the BuildTree algorithm
- Regression
- Classification
- Nearest neighbor
Nearest neighbor classifier
Training: None (or rather: memorize the entire training data)
Predicting/testing:
- for a new data point xnew do:
- find the k points in training data nearest to xnew
- For classification: predict label ynew^ for xnew by taking a majority vote of the k neighbors's labels y
- For regression: predict label ynew^ for xnew by taking an average over the k neighbors' labels y
- find the k points in training data nearest to xnew
- Hyperparameter: k
- Also need
- Distance metric (typically Euclidean or Manhattan distance)
- A tie-breaking scheme (typically at random)




Thanks!
We'd love it for you to share some lecture feedback.
introml-sp24-lec9
By Shen Shen
introml-sp24-lec9
- 127