The Perceptron

Cornell CS 3/5780 · Spring 2026

(down arrow to see handout slides)

0. The Curse of Dimensionality

Dimension ($d$)	Edge Length ($\ell$)
2	0.1
10	0.63
100	0.955
1000	0.9954

For $k=10, n=1000$

The distributions of all pairwise distances between randomly distributed points within $d$-dimensional unit squares.

0. The Curse of Dimensionality

Points drawn from a probability distribution tend to never be close together in high dimensions.
Volume Analysis: For uniform distribution on features, to capture $k$ neighbors in a unit cube $[0,1]^d$, the required edge length $\ell^d \approx k/n$
Question:
- What happens to $\ell$ for $k/n$ fixed and $d$ getting big?
- How big does $n$ need to get to keep $\ell$ constant?
Mitigating the Curse:
- Linear Separation: Pairwise distances between points grow with dimensionality, but distances to hyperplanes do not.
- Low Dimensional Structure: Data often lies on low-dimensional manifolds despite a high-dimensional $d$.

1. Perceptron Classifier

$$ h(\mathbf{x}_i) = \textrm{sign}(\mathbf{w}^\top \mathbf{x}_i + b) $$

hyperplane: $\mathbf{w}^\top \mathbf{x}_i + b =0$

$\bf w$

$0$

$\underbrace{\qquad\quad}_{-b}$

$\bf x$

1. Perceptron Classifier

Core Assumption: Binary classification with $y_i \in \{-1, +1\}$ and data that is linearly separable
Classification Rule: Determined by which side of a hyperplane the input $\mathbf x$ is on.
Formally: given by direction $\mathbf{w}$ and bias $b$ $$ h(\mathbf{x}_i) = \textrm{sign}(\mathbf{w}^\top \mathbf{x}_i + b) $$
Without the bias term, the hyperplane that $\mathbf{w}$ defines would always have to go through the origin.

2. Simplified Formulation

$$ \mathbf{x}_i \hspace{0.1in} \rightarrow \hspace{0.1in} \begin{bmatrix} \mathbf{x}_i \\ 1 \end{bmatrix},\qquad \mathbf{w} \hspace{0.1in} \rightarrow \hspace{0.1in} \begin{bmatrix} \mathbf{w} \\ b \end{bmatrix} $$

2. Simplified Formulation

Absorbing bias term: add one additional constant dimension $$ \mathbf{x}_i \hspace{0.1in} \text{becomes} \hspace{0.1in} \begin{bmatrix} \mathbf{x}_i \\ 1 \end{bmatrix},\qquad \mathbf{w} \hspace{0.1in} \text{becomes} \hspace{0.1in} \begin{bmatrix} \mathbf{w} \\ b \end{bmatrix} $$
New formulation: under the new definition of $\mathbf x$ and weight $\mathbf w$, $$ h(\mathbf{x}_i) = \textrm{sign}(\mathbf{w}^\top \mathbf{x}) $$
Key Observation: Note that $\mathbf{x}_i$ is classified correctly (i.e. on the correct side of the hyperplane) if $$ y_i(\mathbf{w}^\top \mathbf{x}_i) > 0 $$

3. Perceptron Algorithm

Initialize: $\mathbf{w} = \mathbf{0}$

While TRUE:

set $m=0$
for $(\mathbf{x}_i, y_i) \in D$
- if $y_i(\mathbf{w}^\top \mathbf{x}_i) \leq 0$
  - $\mathbf{w} = \mathbf{w} + y_i \mathbf{x}_i$
  - $m = m+1$
if $m=0$: break

1 2 3

4 5 6

7 8 9

1 2 3

4 5 6

7 8 9

1 2 3

4 5 6

7 8 9

1 2 3

4 5 6

7 8 9

1 2 3

4 5 6

7 8 9

Form a group of $d=9$

Each group member tracks a weight!

$\bf x_1=$

$\bf x_2=$

$y_1=+1$

$y_2=-1$

1 2 3

4 5 6

7 8 9

weight visualizer demo

3. Perceptron Algorithm

Input: Training data $D = \{(\mathbf{x}_1, y_1), ..., (\mathbf{x}_n, y_n)\}$

Initialize: $\mathbf{w} = \mathbf{0}$

While TRUE:

set $m=0$
for $(\mathbf{x}_i, y_i) \in D$
- if $y_i(\mathbf{w}^\top \mathbf{x}_i) \leq 0$
  - $\mathbf{w} = \mathbf{w} + y_i \mathbf{x}_i$
  - $m = m+1$
if $m=0$: break

4. Perceptron Convergence

Setup: $\|\mathbf{w}^*\| = 1$, $\|\mathbf{x}_i\| \le 1~~\forall~ \mathbf{x}_i \in D$,
margin $ \gamma = \min_{(\mathbf{x}_i, y_i) \in D}|\mathbf{x}_i^\top \mathbf{w}^* | $

1. hyperplane misclassifies one red (-1) and one blue (+1) point

2. $\bf x$ is chosen and used for an update.

3. updated hyperplane separates the two classes

4. Perceptron Convergence

Guarantee: If a data set is linearly separable, Perceptron finds a separating hyperplane in finite steps.
Separability: $\exists \mathbf{w}^*$ such that $y_i(\mathbf{x}^\top \mathbf{w}^* ) > 0,$ for all $(\mathbf{x}_i, y_i) \in D$.
Rescaling: weights, features such that $\|\mathbf{w}^*\| = 1$, $\|\mathbf{x}_i\| \le 1~~\forall~ \mathbf{x}_i \in D$
Margin: the distance $\gamma$ from the hyperplane to the closest data point: $$ \gamma = \min_{(\mathbf{x}_i, y_i) \in D}|\mathbf{x}_i^\top \mathbf{w}^* | $$
Key Observation: For all $\mathbf{x}$ we must have $y(\mathbf{x}^\top \mathbf{w}^*)=|\mathbf{x}^\top \mathbf{w}^*|\geq \gamma$.

5. Convergence Theorem

$\bf w$

5. Convergence Theorem

Theorem: For separable data with margin $\gamma$, the Perceptron algorithm makes at most $1 / \gamma^2$ mistakes.
Question: What is more desirable, a large margin or a small margin? When will the Perceptron converge quickly?
Fact 1: for misclassified $\mathbf x$, we have $y( \mathbf{x}^\top \mathbf{w})\leq 0$
Fact 2: for any $\mathbf x$, we have $y( \mathbf{x}^\top \mathbf{w}^*)>\gamma$ due to margin (previous slide)

6. Convergence Proof

Proof idea: show $\bf w$ becomes close to $\bf w^\star$
Recall that the cosine dot product formula $$\cos\left(\theta\right) = \frac{\bf w^\star\cdot \bf w }{\|\bf w^\star\| \|\bf w\|}$$

$1$

$\vec v$

$\vec u$

)

$\theta$

$\vec v$

$\vec u$

$\vec u \cdot \vec v=0$

$\vec u \cdot \vec v>0$

$\vec u \cdot \vec v<0$

$\vec v$

$\vec u$

)

6. Convergence Proof part 1

Consider the effect of an update on $\mathbf{w}^\top \mathbf{w}^*$: $$\begin{align*} (\mathbf{w} + y\mathbf{x})^\top \mathbf{w}^* &= \mathbf{w}^\top \mathbf{w}^* + y(\mathbf{x}^\top \mathbf{w}^*) \\ &\ge \mathbf{w}^\top \mathbf{w}^* + \gamma \end{align*} $$
Consider the effect of an update on $\mathbf{w}^\top \mathbf{w}$: $$ \begin{align*}(\mathbf{w} + y\mathbf{x})^\top (\mathbf{w} + y\mathbf{x}) &= \mathbf{w}^\top \mathbf{w} +2y\mathbf{w}^\top \mathbf{x} +y^2(\mathbf{x}^\top \mathbf{x}) \\&\le \mathbf{w}^\top \mathbf{w} + y^2(\mathbf{x}^\top \mathbf{x}) \\&\le \mathbf{w}^\top \mathbf{w} + 1 \end{align*}$$
This means that for each update, $\mathbf{w}^\top \mathbf{w}^* $ grows by at least $\gamma$ and $\mathbf{w}^\top \mathbf{w}$ grows by at most 1.

7. Convergence Proof part 2

We initialize $\mathbf{w}={0}$. Hence, initially $\mathbf{w}^\top\mathbf{w}=0$ and $\mathbf{w}^\top\mathbf{w}^*=0$.
After $M$ updates, (1) $\mathbf{w}^\top\mathbf{w}^*\geq M\gamma$ and (2) $\mathbf{w}^\top \mathbf{w}\leq M$
Starting with (1) and ending with (2) $$ \begin{align*} M\gamma &\le \mathbf{w}^\top \mathbf{w}^* \\ &=\|\mathbf{w}\|\|\mathbf{w}^*\|\cos(\theta) \\ &\leq \|\mathbf{w}\| \\ &= \sqrt{\mathbf{w}^\top \mathbf{w}} \le \sqrt{M} \end{align*} $$
Rearranging $M\gamma \le \sqrt{M}$, we conclude $M \le {1}/{\gamma^2}$

7. History

Frank Rosenblatt

New York Times, 1958

IBM 704

MARK I
Perceptron, 1960

Minsky & Papert

Perceptrons, 1969

$\underbrace{\quad\qquad\qquad}_{\vec x}$

$\underbrace{\qquad}_{h_{\vec w}}$

${\vec w}$

$\underbrace{\qquad}$

updates
with $\vec y$

$\underbrace{\qquad}$

No good algorithm for multiple layers (yet)

Fundamental limitations of linear classifiers (XOR)

8. Summary

The Perceptron is a binary linear classifier
We absorb the bias term by adding a constant feature dimension
Guaranteed to converge if data is linearly separable
- Number of mistakes bounded by $1/\gamma^2$ where $\gamma$ is the margin
- Larger margins lead to faster convergence
Cannot solve non-linearly separable problems (like XOR)

The Perceptron

By Sarah Dean

The Perceptron

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

The Perceptron

Cornell CS 3/5780 · Spring 2026

0. The Curse of Dimensionality

0. The Curse of Dimensionality

1. Perceptron Classifier

1. Perceptron Classifier

2. Simplified Formulation

2. Simplified Formulation

3. Perceptron Algorithm

3. Perceptron Algorithm

4. Perceptron Convergence

4. Perceptron Convergence

5. Convergence Theorem

5. Convergence Theorem

6. Convergence Proof

6. Convergence Proof part 1

7. Convergence Proof part 2

7. History

8. Summary

The Perceptron

The Perceptron

Sarah Dean PRO

More from Sarah Dean