From Linear Systems to NNs

Aadharsh Aadhithya A

3x+4y=12

3x+4y=12

What does it mean?

Now, From these infinite solutions, give me a solution that is closest to the origin!

\begin{bmatrix} 3 & 4 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} = 12

\begin{bmatrix} 3 & 4 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} = 12

The solution lies on the perpendicular line from the origin?

But what is the direction perpendicular to that line?

y = -\frac{3}{4}x+3

y = -\frac{3}{4}x+3

y=mx+c

y=mx+c

m = \frac{\Delta y}{\Delta x}

m = \frac{\Delta y}{\Delta x}

Now, From these infinite solutions, give me a solution that is closest to the origin!

\begin{bmatrix} 3 & 4 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} = 12

\begin{bmatrix} 3 & 4 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} = 12

The solution lies on the perpendicular line from the origin?

But what is the direction perpendicular to that line?

y = -\frac{3}{4}x+3

y = -\frac{3}{4}x+3

y=mx+c

y=mx+c

m = \frac{\Delta y}{\Delta x}

m = \frac{\Delta y}{\Delta x}

\begin{bmatrix} 4 \\ -3 \end{bmatrix}

\begin{bmatrix} 4 \\ -3 \end{bmatrix}

Now, From these infinite solutions, give me a solution that is closest to the origin!

\begin{bmatrix} 3 & 4 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} = 12

\begin{bmatrix} 3 & 4 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} = 12

The solution lies on the perpendicular line from the origin?

But what is the direction perpendicular to that line?

\begin{bmatrix} 4 \\ -3 \end{bmatrix}

\begin{bmatrix} 4 \\ -3 \end{bmatrix}

A vector that is perpendicular to [4 3]

\begin{bmatrix} 4\\ -3 \end{bmatrix} \cdot v = 0

\begin{bmatrix} 4\\ -3 \end{bmatrix} \cdot v = 0

v = \begin{bmatrix} 3\\ 4 \end{bmatrix}

v = \begin{bmatrix} 3\\ 4 \end{bmatrix}

\begin{bmatrix} 3 \\ 4 \end{bmatrix}

\begin{bmatrix} 3 \\ 4 \end{bmatrix}

Now, From these infinite solutions, give me a solution that is closest to the origin!

\begin{bmatrix} 3 & 4 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} = 12

\begin{bmatrix} 3 & 4 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} = 12

The solution lies on the perpendicular line from the origin?

\begin{bmatrix} 4 \\ 3 \end{bmatrix}

\begin{bmatrix} 4 \\ 3 \end{bmatrix}

\begin{bmatrix} 3 \\ 4 \end{bmatrix}

\begin{bmatrix} 3 \\ 4 \end{bmatrix}

Then This solution is of the form

c \cdot \begin{bmatrix} 3 \\ 4 \end{bmatrix}

c \cdot \begin{bmatrix} 3 \\ 4 \end{bmatrix}

Excercise: Find c

Least Norm Solution

Now, From these infinite solutions, give me a solution that is closest to the origin!

Least Norm Solution

\begin{bmatrix} 3 & 4 & 5 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = 12

\begin{bmatrix} 3 & 4 & 5 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = 12

This is a plane in 3D

Infinite solutions are possible. "Least-Norm" Solution is again in a direction perpendicular to this plane

Now, From these infinite solutions, give me a solution that is closest to the origin!

Least Norm Solution

\begin{bmatrix} 3 & 4 & 5 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = 12

\begin{bmatrix} 3 & 4 & 5 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = 12

This is a plane in 3D

Infinite solutions are possible. "Least-Norm" Solution is again in a direction perpendicular to this plane

Direction that is perpendicular to the plane is:

\begin{bmatrix} 3 \\ 4 \\ 5 \end{bmatrix}

\begin{bmatrix} 3 \\ 4 \\ 5 \end{bmatrix}

Now, From these infinite solutions, give me a solution that is closest to the origin!

\begin{bmatrix} 3 & 4 & 5 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = 12

\begin{bmatrix} 3 & 4 & 5 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = 12

This is a plane in 3D

Infinite solutions are possible. "Least-Norm" Solution is again in a direction perpendicular to this plane

c \cdot \begin{bmatrix} 3 \\ 4 \\ 5 \end{bmatrix}

c \cdot \begin{bmatrix} 3 \\ 4 \\ 5 \end{bmatrix}

Least norm solution is of the form

Now, From these infinite solutions, give me a solution that is closest to the origin!

\begin{bmatrix} 3 & 4 & 5 \\ 6 & 7 & 8 \\ \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 12 \\ 13 \end{bmatrix}

\begin{bmatrix} 3 & 4 & 5 \\ 6 & 7 & 8 \\ \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 12 \\ 13 \end{bmatrix}

What does this mean geometrically?

Now, From these infinite solutions, give me a solution that is closest to the origin!

\begin{bmatrix} 3 & 4 & 5 \\ 6 & 7 & 8 \\ \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 12 \\ 13 \end{bmatrix}

\begin{bmatrix} 3 & 4 & 5 \\ 6 & 7 & 8 \\ \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 12 \\ 13 \end{bmatrix}

What does this mean geometrically?

Intersection of two planes: is a line

Infinite solutions!

Infinite points. Find point, closest to the origin!

Now, From these infinite solutions, give me a solution that is closest to the origin!

\begin{bmatrix} 3 & 4 & 5 \\ 6 & 7 & 8 \\ \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 12 \\ 13 \end{bmatrix}

\begin{bmatrix} 3 & 4 & 5 \\ 6 & 7 & 8 \\ \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 12 \\ 13 \end{bmatrix}

What does this mean geometrically?

Intersection of two planes: is a line

Infinite solutions!

Infinite points. Find point, closest to the origin!

The Point is on a line Orthogonal to the Line of intersection of both the planes.

Now, From these infinite solutions, give me a solution that is closest to the origin!

\begin{bmatrix} 3 & 4 & 5 \\ 6 & 7 & 8 \\ \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 12 \\ 13 \end{bmatrix}

\begin{bmatrix} 3 & 4 & 5 \\ 6 & 7 & 8 \\ \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 12 \\ 13 \end{bmatrix}

What does this mean geometrically?

Intersection of two planes: is a line

Infinite solutions!

Infinite points. Find point, closest to the origin!

The Point is on a line Orthogonal to the Line of intersection of both the planes.

But But, There are Skew lines in 3D!

But it should be on the plane perpendicular to the line

Now, From these infinite solutions, give me a solution that is closest to the origin!

\begin{bmatrix} 3 & 4 & 5 \\ 6 & 7 & 8 \\ \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 12 \\ 13 \end{bmatrix}

\begin{bmatrix} 3 & 4 & 5 \\ 6 & 7 & 8 \\ \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 12 \\ 13 \end{bmatrix}

What does this mean geometrically?

Intersection of two planes: is a line

Infinite solutions!

Infinite points. Find point, closest to the origin!

But But, There are Skew lines in 3D!

But it should be on the plane perpendicular to the line

Now, From these infinite solutions, give me a solution that is closest to the origin!

\begin{bmatrix} 3 & 4 & 5 \\ 6 & 7 & 8 \\ \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 12 \\ 13 \end{bmatrix}

\begin{bmatrix} 3 & 4 & 5 \\ 6 & 7 & 8 \\ \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 12 \\ 13 \end{bmatrix}

What does this mean geometrically?

Intersection of two planes: is a line

Infinite solutions!

Infinite points. Find point, closest to the origin!

But But, There are Skew lines in 3D!

But it should be on the plane perpendicular to the line

c_1 \cdot \begin{bmatrix} 3 \\4\\5 \end{bmatrix} + c_2 \cdot \begin{bmatrix} 6 \\7\\8 \end{bmatrix}

c_1 \cdot \begin{bmatrix} 3 \\4\\5 \end{bmatrix} + c_2 \cdot \begin{bmatrix} 6 \\7\\8 \end{bmatrix}

Now, From these infinite solutions, give me a solution that is closest to the origin!

\begin{bmatrix} 3 & 4 & 5 \\ 6 & 7 & 8 \\ \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 12 \\ 13 \end{bmatrix}

\begin{bmatrix} 3 & 4 & 5 \\ 6 & 7 & 8 \\ \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 12 \\ 13 \end{bmatrix}

Infinite solutions!

c_1 \cdot \begin{bmatrix} 3 \\4\\5 \end{bmatrix} + c_2 \cdot \begin{bmatrix} 6 \\7\\8 \end{bmatrix}

c_1 \cdot \begin{bmatrix} 3 \\4\\5 \end{bmatrix} + c_2 \cdot \begin{bmatrix} 6 \\7\\8 \end{bmatrix}

LESSON

The least norm solution, always lies in the row space of A

Now, From these infinite solutions, give me a solution that is closest to the origin!

\begin{bmatrix} 3 & 4 & 5 \\ 6 & 7 & 8 \\ \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 12 \\ 13 \end{bmatrix}

\begin{bmatrix} 3 & 4 & 5 \\ 6 & 7 & 8 \\ \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 12 \\ 13 \end{bmatrix}

Infinite solutions!

c_1 \cdot \begin{bmatrix} 3 \\4\\5 \end{bmatrix} + c_2 \cdot \begin{bmatrix} 6 \\7\\8 \end{bmatrix}

c_1 \cdot \begin{bmatrix} 3 \\4\\5 \end{bmatrix} + c_2 \cdot \begin{bmatrix} 6 \\7\\8 \end{bmatrix}

Theorem

The least norm solution, always lies in the row space of A

Representer Theorem

Pseudoinverse

A = U \Sigma V^T

A = U \Sigma V^T

Ax=b

Ax=b

(U \Sigma V^T) x=b

(U \Sigma V^T) x=b

U^TU=UU^T = I

U^TU=UU^T = I

\Sigma V^T x= U^T b

\Sigma V^T x= U^T b

\Sigma = \begin{bmatrix} \sigma_1 & 0 & 0 & \cdots & 0 & 0 & 0 & 0 & 0 \\ 0 & \sigma_2 & 0 & \cdots & 0 & 0& 0 & 0 & 0 \\ 0 & 0 & \sigma_3 & \cdots & 0 & 0& 0 & 0 & 0 \\ \vdots & \vdots & \ddots & \ddots & \vdots & \vdots & \vdots & \vdots & \vdots \\ 0 & \cdots & \cdots & \cdots& \sigma_r & \cdots & 0 & 0 & 0 \\ 0 & 0 & 0 & \cdots& \cdots & \cdots & 0 & 0 & 0 \\ 0 & 0 & 0 & \cdots& \cdots & \cdots & 0 & 0 & 0 \\ 0 & 0 & 0 & \cdots& \cdots & \cdots & 0 & 0 & 0 \\ \end{bmatrix}

\Sigma = \begin{bmatrix} \sigma_1 & 0 & 0 & \cdots & 0 & 0 & 0 & 0 & 0 \\ 0 & \sigma_2 & 0 & \cdots & 0 & 0& 0 & 0 & 0 \\ 0 & 0 & \sigma_3 & \cdots & 0 & 0& 0 & 0 & 0 \\ \vdots & \vdots & \ddots & \ddots & \vdots & \vdots & \vdots & \vdots & \vdots \\ 0 & \cdots & \cdots & \cdots& \sigma_r & \cdots & 0 & 0 & 0 \\ 0 & 0 & 0 & \cdots& \cdots & \cdots & 0 & 0 & 0 \\ 0 & 0 & 0 & \cdots& \cdots & \cdots & 0 & 0 & 0 \\ 0 & 0 & 0 & \cdots& \cdots & \cdots & 0 & 0 & 0 \\ \end{bmatrix}

\Sigma^{\dagger} = \begin{bmatrix} \frac{1}{\sigma_1} & 0 & 0 & \cdots & 0 & 0 & 0 & 0 & 0 \\ 0 & \frac{1}{\sigma_2} & 0 & \cdots & 0 & 0& 0 & 0 & 0 \\ 0 & 0 & \frac{1}{\sigma_3} & \cdots & 0 & 0& 0 & 0 & 0 \\ \vdots & \vdots & \ddots & \ddots & \vdots & \vdots & \vdots & \vdots & \vdots \\ 0 & \cdots & \cdots & \cdots& \frac{1}{\sigma_r} & \cdots & 0 & 0 & 0 \\ 0 & 0 & 0 & \cdots& \cdots & \cdots & 0 & 0 & 0 \\ 0 & 0 & 0 & \cdots& \cdots & \cdots & 0 & 0 & 0 \\ 0 & 0 & 0 & \cdots& \cdots & \cdots & 0 & 0 & 0 \\ \end{bmatrix}

\Sigma^{\dagger} = \begin{bmatrix} \frac{1}{\sigma_1} & 0 & 0 & \cdots & 0 & 0 & 0 & 0 & 0 \\ 0 & \frac{1}{\sigma_2} & 0 & \cdots & 0 & 0& 0 & 0 & 0 \\ 0 & 0 & \frac{1}{\sigma_3} & \cdots & 0 & 0& 0 & 0 & 0 \\ \vdots & \vdots & \ddots & \ddots & \vdots & \vdots & \vdots & \vdots & \vdots \\ 0 & \cdots & \cdots & \cdots& \frac{1}{\sigma_r} & \cdots & 0 & 0 & 0 \\ 0 & 0 & 0 & \cdots& \cdots & \cdots & 0 & 0 & 0 \\ 0 & 0 & 0 & \cdots& \cdots & \cdots & 0 & 0 & 0 \\ 0 & 0 & 0 & \cdots& \cdots & \cdots & 0 & 0 & 0 \\ \end{bmatrix}

Pseudoinverse

A = U \Sigma V^T

A = U \Sigma V^T

Ax=b

Ax=b

(U \Sigma V^T) x=b

(U \Sigma V^T) x=b

U^TU=UU^T = I

U^TU=UU^T = I

V^T x= \Sigma^{\dagger}(U^T b)

V^T x= \Sigma^{\dagger}(U^T b)

\Sigma^{\dagger} = \begin{bmatrix} \frac{1}{\sigma_1} & 0 & 0 & \cdots & 0 & 0 & 0 & 0 & 0 \\ 0 & \frac{1}{\sigma_2} & 0 & \cdots & 0 & 0& 0 & 0 & 0 \\ 0 & 0 & \frac{1}{\sigma_3} & \cdots & 0 & 0& 0 & 0 & 0 \\ \vdots & \vdots & \ddots & \ddots & \vdots & \vdots & \vdots & \vdots & \vdots \\ 0 & \cdots & \cdots & \cdots& \frac{1}{\sigma_r} & \cdots & 0 & 0 & 0 \\ 0 & 0 & 0 & \cdots& \cdots & \cdots & 0 & 0 & 0 \\ 0 & 0 & 0 & \cdots& \cdots & \cdots & 0 & 0 & 0 \\ 0 & 0 & 0 & \cdots& \cdots & \cdots & 0 & 0 & 0 \\ 0 & 0 & 0 & \cdots& \cdots & \cdots & 0 & 0 & 0 \\ \end{bmatrix}

Pseudoinverse

A = U \Sigma V^T

A = U \Sigma V^T

Ax=b

Ax=b

(U \Sigma V^T) x=b

(U \Sigma V^T) x=b

U^TU=UU^T = I

U^TU=UU^T = I

V^T x= \Sigma^{\dagger}(U^T b)

V^T x= \Sigma^{\dagger}(U^T b)

\Sigma^{\dagger} = \begin{bmatrix} \frac{1}{\sigma_1} & 0 & 0 & \cdots & 0 & 0 & 0 & 0 & 0 \\ 0 & \frac{1}{\sigma_2} & 0 & \cdots & 0 & 0& 0 & 0 & 0 \\ 0 & 0 & \frac{1}{\sigma_3} & \cdots & 0 & 0& 0 & 0 & 0 \\ \vdots & \vdots & \ddots & \ddots & \vdots & \vdots & \vdots & \vdots & \vdots \\ 0 & \cdots & \cdots & \cdots& \frac{1}{\sigma_r} & \cdots & 0 & 0 & 0 \\ 0 & 0 & 0 & \cdots& \cdots & \cdots & 0 & 0 & 0 \\ 0 & 0 & 0 & \cdots& \cdots & \cdots & 0 & 0 & 0 \\ 0 & 0 & 0 & \cdots& \cdots & \cdots & 0 & 0 & 0 \\ 0 & 0 & 0 & \cdots& \cdots & \cdots & 0 & 0 & 0 \\ \end{bmatrix}

+

+

+

+

+

+

Pseudoinverse

A = U \Sigma V^T

A = U \Sigma V^T

Ax=b

Ax=b

(U \Sigma V^T) x=b

(U \Sigma V^T) x=b

U^TU=UU^T = I

U^TU=UU^T = I

V^T x= \Sigma^{\dagger}(U^T b)

V^T x= \Sigma^{\dagger}(U^T b)

\Sigma^{\dagger} = \begin{bmatrix} \frac{1}{\sigma_1} & 0 & 0 & \cdots & 0 & 0 & 0 & 0 & 0 \\ 0 & \frac{1}{\sigma_2} & 0 & \cdots & 0 & 0& 0 & 0 & 0 \\ 0 & 0 & \frac{1}{\sigma_3} & \cdots & 0 & 0& 0 & 0 & 0 \\ \vdots & \vdots & \ddots & \ddots & \vdots & \vdots & \vdots & \vdots & \vdots \\ 0 & \cdots & \cdots & \cdots& \frac{1}{\sigma_r} & \cdots & 0 & 0 & 0 \\ 0 & 0 & 0 & \cdots& \cdots & \cdots & 0 & 0 & 0 \\ 0 & 0 & 0 & \cdots& \cdots & \cdots & 0 & 0 & 0 \\ 0 & 0 & 0 & \cdots& \cdots & \cdots & 0 & 0 & 0 \\ 0 & 0 & 0 & \cdots& \cdots & \cdots & 0 & 0 & 0 \\ \end{bmatrix}

Pseudoinverse

A = U \Sigma V^T

A = U \Sigma V^T

Ax=b

Ax=b

(U \Sigma V^T) x=b

(U \Sigma V^T) x=b

U^TU=UU^T = I

U^TU=UU^T = I

x= V \Sigma^{\dagger}(U^T b)

x= V \Sigma^{\dagger}(U^T b)

=

+

+

+

+

+

+

+

+

Your Solution is a linear combination of Row Space

Pseudoinverse

A = U \Sigma V^T

A = U \Sigma V^T

Ax=b

Ax=b

(U \Sigma V^T) x=b

(U \Sigma V^T) x=b

U^TU=UU^T = I

U^TU=UU^T = I

x= V \Sigma^{\dagger}(U^T b)

x= V \Sigma^{\dagger}(U^T b)

=

Pinv Gives: Least Norm Solution

If b is not in col space, It gives Least Squares and Least Norm solution

Linear Regression

y=mx+c

y=mx+c

(x_i,y_i)

(x_i,y_i)

\begin{bmatrix} x_1 & 1 \\ x_2 & 1 \\ x_3 & 1 \\ \vdots \\ x_n & 1 \\ \end{bmatrix} \cdot \begin{bmatrix} m\\ c \end{bmatrix} = \begin{bmatrix} y_1\\ y_2 \\ y_3 \\ \vdots \\ y_n \\ \end{bmatrix}

\begin{bmatrix} x_1 & 1 \\ x_2 & 1 \\ x_3 & 1 \\ \vdots \\ x_n & 1 \\ \end{bmatrix} \cdot \begin{bmatrix} m\\ c \end{bmatrix} = \begin{bmatrix} y_1\\ y_2 \\ y_3 \\ \vdots \\ y_n \\ \end{bmatrix}

Linear Regression

y=mx+c

y=mx+c

(x_i,y_i)

(x_i,y_i)

\begin{bmatrix} x_1 & 1 \\ x_2 & 1 \\ x_3 & 1 \\ \vdots \\ x_n & 1 \\ \end{bmatrix} \cdot \begin{bmatrix} m\\ c \end{bmatrix} = \begin{bmatrix} y_1\\ y_2 \\ y_3 \\ \vdots \\ y_n \\ \end{bmatrix}

\begin{bmatrix} x_1 & 1 \\ x_2 & 1 \\ x_3 & 1 \\ \vdots \\ x_n & 1 \\ \end{bmatrix} \cdot \begin{bmatrix} m\\ c \end{bmatrix} = \begin{bmatrix} y_1\\ y_2 \\ y_3 \\ \vdots \\ y_n \\ \end{bmatrix}

\min_w|| y - w^Tx||

\min_w|| y - w^Tx||

\sum_{i=1}^N (y_i - w x_i)^2

\sum_{i=1}^N (y_i - w x_i)^2

X w = Y

X w = Y

Linear Regression

y=mx+c

y=mx+c

(x_i,y_i)

(x_i,y_i)

\begin{bmatrix} x_1 & 1 \\ x_2 & 1 \\ x_3 & 1 \\ \vdots \\ x_n & 1 \\ \end{bmatrix} \cdot \begin{bmatrix} m\\ c \end{bmatrix} \approx \begin{bmatrix} y_1\\ y_2 \\ y_3 \\ \vdots \\ y_n \\ \end{bmatrix}

\begin{bmatrix} x_1 & 1 \\ x_2 & 1 \\ x_3 & 1 \\ \vdots \\ x_n & 1 \\ \end{bmatrix} \cdot \begin{bmatrix} m\\ c \end{bmatrix} \approx \begin{bmatrix} y_1\\ y_2 \\ y_3 \\ \vdots \\ y_n \\ \end{bmatrix}

\min_w|| Y - w^T X||_2^2

\min_w|| Y - w^T X||_2^2

\sum_{i=1}^N (y_i - w x_i)^2

\sum_{i=1}^N (y_i - w x_i)^2

X w \approx Y

X w \approx Y

w =X^{\dagger} Y

w =X^{\dagger} Y

Kernel Regression

Perform Linear Regression, after Mapping Data

\psi: \mathbb{R}^d \rightarrow \mathbb{R}^k

\psi: \mathbb{R}^d \rightarrow \mathbb{R}^k

L(w) = \sum_{i=1}^N (y_i - w x_i)^2

L(w) = \sum_{i=1}^N (y_i - w x_i)^2

L(w) = \sum_{i=1}^N (y_i - w \cdot \psi(x_i))^2

L(w) = \sum_{i=1}^N (y_i - w \cdot \psi(x_i))^2

Linear Regression

Kernel Regression

Perform Linear Regression, after Mapping Data

\psi: \mathbb{R}^d \rightarrow \mathbb{R}^k

\psi: \mathbb{R}^d \rightarrow \mathbb{R}^k

L(w) = \sum_{i=1}^N (y_i - w x_i)^2

L(w) = \sum_{i=1}^N (y_i - w x_i)^2

L(w) = \sum_{i=1}^N (y_i - w \cdot \psi(x_i))^2

L(w) = \sum_{i=1}^N (y_i - w \cdot \psi(x_i))^2

Linear Regression

Kernel Regression

Typically we map to higher dimentions, i.e k>d

\mathbb{R}^k

\mathbb{R}^k

Kernel Regression

Perform Linear Regression, after Mapping Data

\psi: \mathbb{R}^d \rightarrow \mathbb{R}^k

\psi: \mathbb{R}^d \rightarrow \mathbb{R}^k

L(w) = \sum_{i=1}^N (y_i - w x_i)^2

L(w) = \sum_{i=1}^N (y_i - w x_i)^2

L(w) = \sum_{i=1}^N (y_i - w \cdot \psi(x_i))^2

L(w) = \sum_{i=1}^N (y_i - w \cdot \psi(x_i))^2

Linear Regression

Kernel Regression

\mathbb{R}^k

\mathbb{R}^k

We have k unknowns

Kernel Regression

Perform Linear Regression, after Mapping Data

\psi: \mathbb{R}^d \rightarrow \mathbb{R}^k

\psi: \mathbb{R}^d \rightarrow \mathbb{R}^k

L(w) = \sum_{i=1}^N (y_i - w \cdot \psi(x_i))^2

L(w) = \sum_{i=1}^N (y_i - w \cdot \psi(x_i))^2

Least norm solution,

w = \sum_j^{N} \alpha_j \psi(x_j)

w = \sum_j^{N} \alpha_j \psi(x_j)

Kernel Regression

\psi: \mathbb{R}^d \rightarrow \mathbb{R}^k

\psi: \mathbb{R}^d \rightarrow \mathbb{R}^k

L(\alpha) = \sum_{i=1}^N \left (y_i - \left ( \sum_j^{N} \alpha_j \psi(x_j) \right ) \cdot \psi(x_i) \right)^2

L(\alpha) = \sum_{i=1}^N \left (y_i - \left ( \sum_j^{N} \alpha_j \psi(x_j) \right ) \cdot \psi(x_i) \right)^2

w = \sum_j^{N} \alpha_j \psi(x_j)

w = \sum_j^{N} \alpha_j \psi(x_j)

\alpha_1 \psi(x_1) \psi(x_i) + \\ \alpha_2 \psi(x_2) \psi(x_i) + \\ \alpha_3 \psi(x_3) \psi(x_i) + \\ \cdots\\ \alpha_n \psi(x_n) \psi(x_i) +

\alpha_1 \psi(x_1) \psi(x_i) + \\ \alpha_2 \psi(x_2) \psi(x_i) + \\ \alpha_3 \psi(x_3) \psi(x_i) + \\ \cdots\\ \alpha_n \psi(x_n) \psi(x_i) +

Kernel Regression

\psi: \mathbb{R}^d \rightarrow \mathbb{R}^k

\psi: \mathbb{R}^d \rightarrow \mathbb{R}^k

L(w) = \sum_{i=1}^N \left (y_i - \left ( \sum_j^{N} \alpha_j \psi(x_j) \right ) \cdot \psi(x_i) \right)^2

L(w) = \sum_{i=1}^N \left (y_i - \left ( \sum_j^{N} \alpha_j \psi(x_j) \right ) \cdot \psi(x_i) \right)^2

w = \sum_j^{N} \alpha_j \psi(x_j)

w = \sum_j^{N} \alpha_j \psi(x_j)

\alpha_1 \psi(x_1) \psi(x_i) + \\ \alpha_2 \psi(x_2) \psi(x_i) + \\ \alpha_3 \psi(x_3) \psi(x_i) + \\ \cdots\\ \alpha_n \psi(x_n) \psi(x_i) +

\alpha_1 \psi(x_1) \psi(x_i) + \\ \alpha_2 \psi(x_2) \psi(x_i) + \\ \alpha_3 \psi(x_3) \psi(x_i) + \\ \cdots\\ \alpha_n \psi(x_n) \psi(x_i) +

\begin{bmatrix} \alpha_1 & \alpha_2 & \alpha_3 \cdots \alpha_n \end{bmatrix}\begin{bmatrix} \psi(x_1) \cdot \psi(x_i) \\ \psi(x_2) \cdot \psi(x_i) \\ \psi(x_3) \cdot \psi(x_i) \\ \cdots\\ \psi(x_n) \cdot \psi(x_i) \end{bmatrix}

\begin{bmatrix} \alpha_1 & \alpha_2 & \alpha_3 \cdots \alpha_n \end{bmatrix}\begin{bmatrix} \psi(x_1) \cdot \psi(x_i) \\ \psi(x_2) \cdot \psi(x_i) \\ \psi(x_3) \cdot \psi(x_i) \\ \cdots\\ \psi(x_n) \cdot \psi(x_i) \end{bmatrix}

Kernel Regression

\psi: \mathbb{R}^d \rightarrow \mathbb{R}^k

\psi: \mathbb{R}^d \rightarrow \mathbb{R}^k

L(w) = \sum_{i=1}^N \left (y_i - \left ( \sum_j^{N} \alpha_j \psi(x_j) \right ) \cdot \psi(x_i) \right)^2

L(w) = \sum_{i=1}^N \left (y_i - \left ( \sum_j^{N} \alpha_j \psi(x_j) \right ) \cdot \psi(x_i) \right)^2

w = \sum_j^{N} \alpha_j \psi(x_j)

w = \sum_j^{N} \alpha_j \psi(x_j)

\begin{bmatrix} \alpha_1 & \alpha_2 & \alpha_3 \cdots \alpha_n \end{bmatrix}\begin{bmatrix} \psi(x_1) \cdot \psi(x_i) \\ \psi(x_2) \cdot \psi(x_i) \\ \psi(x_3) \cdot \psi(x_i) \\ \cdots\\ \psi(x_n) \cdot \psi(x_i) \end{bmatrix}

\alpha^T

\alpha^T

K(X,x_i)

K(X,x_i)

Kernel Regression

\psi: \mathbb{R}^d \rightarrow \mathbb{R}^k

\psi: \mathbb{R}^d \rightarrow \mathbb{R}^k

L(w) = \sum_{i=1}^N \left (y_i - \left ( \sum_j^{N} \alpha_j \psi(x_j) \right ) \cdot \psi(x_i) \right)^2

L(w) = \sum_{i=1}^N \left (y_i - \left ( \sum_j^{N} \alpha_j \psi(x_j) \right ) \cdot \psi(x_i) \right)^2

w = \sum_j^{N} \alpha_j \psi(x_j)

w = \sum_j^{N} \alpha_j \psi(x_j)

\begin{bmatrix} \alpha_1 & \alpha_2 & \alpha_3 \cdots \alpha_n \end{bmatrix}\begin{bmatrix} \psi(x_1) \cdot \psi(x_i) \\ \psi(x_2) \cdot \psi(x_i) \\ \psi(x_3) \cdot \psi(x_i) \\ \cdots\\ \psi(x_n) \cdot \psi(x_i) \end{bmatrix}

\alpha^T

\alpha^T

K(X,x_i)

K(X,x_i)

L(\alpha) = \sum_{i=1}^N \left (y_i - \alpha^T K(X,x_i) \right)^2

L(\alpha) = \sum_{i=1}^N \left (y_i - \alpha^T K(X,x_i) \right)^2

Kernel Regression

\psi: \mathbb{R}^d \rightarrow \mathbb{R}^k

\psi: \mathbb{R}^d \rightarrow \mathbb{R}^k

L(w) = \sum_{i=1}^N \left (y_i - \left ( \sum_j^{N} \alpha_j \psi(x_j) \right ) \cdot \psi(x_i) \right)^2

L(w) = \sum_{i=1}^N \left (y_i - \left ( \sum_j^{N} \alpha_j \psi(x_j) \right ) \cdot \psi(x_i) \right)^2

w = \sum_j^{N} \alpha_j \psi(x_j)

w = \sum_j^{N} \alpha_j \psi(x_j)

\begin{bmatrix} \alpha_1 & \alpha_2 & \alpha_3 \cdots \alpha_n \end{bmatrix}\begin{bmatrix} \psi(x_1) \cdot \psi(x_i) \\ \psi(x_2) \cdot \psi(x_i) \\ \psi(x_3) \cdot \psi(x_i) \\ \cdots\\ \psi(x_n) \cdot \psi(x_i) \end{bmatrix}

\alpha^T

\alpha^T

K(X,x_i)

K(X,x_i)

L(w) = \sum_{i=1}^N \left (y_i - \alpha^T K(X,x_i) \right)^2

L(w) = \sum_{i=1}^N \left (y_i - \alpha^T K(X,x_i) \right)^2

N unknowns

Even if we map to "infinite dimensions", we still have to determine only N unknowns

Kernel Regression

\psi: \mathbb{R}^d \rightarrow \mathbb{R}^k

\psi: \mathbb{R}^d \rightarrow \mathbb{R}^k

L(w) = \sum_{i=1}^N \left (y_i - \left ( \sum_j^{N} \alpha_j \psi(x_j) \right ) \cdot \psi(x_i) \right)^2

L(w) = \sum_{i=1}^N \left (y_i - \left ( \sum_j^{N} \alpha_j \psi(x_j) \right ) \cdot \psi(x_i) \right)^2

w = \sum_j^{N} \alpha_j \psi(x_j)

w = \sum_j^{N} \alpha_j \psi(x_j)

L(w) = \sum_{i=1}^N \left (y_i - \alpha^T K(X,x_i) \right)^2

L(w) = \sum_{i=1}^N \left (y_i - \alpha^T K(X,x_i) \right)^2

L(w) = || y - \alpha \hat{K} ||_2^2

L(w) = || y - \alpha \hat{K} ||_2^2

This is exactly Linear regression over alpha

\hat{K} = \begin{bmatrix} \psi(x_1) \psi(x_1) & \psi(x_1) \psi(x_2) & \cdots & \psi(x_1) \psi(x_n) \\ \psi(x_2) \psi(x_1) & \psi(x_2) \psi(x_2) & \cdots & \psi(x_2) \psi(x_n) \\ \vdots & \vdots & \ddots & \vdots \\ \psi(x_n) \psi(x_1) & \psi(x_n) \psi(x_n) & \cdots & \psi(x_n) \psi(x_n) \\ \end{bmatrix}

\hat{K} = \begin{bmatrix} \psi(x_1) \psi(x_1) & \psi(x_1) \psi(x_2) & \cdots & \psi(x_1) \psi(x_n) \\ \psi(x_2) \psi(x_1) & \psi(x_2) \psi(x_2) & \cdots & \psi(x_2) \psi(x_n) \\ \vdots & \vdots & \ddots & \vdots \\ \psi(x_n) \psi(x_1) & \psi(x_n) \psi(x_n) & \cdots & \psi(x_n) \psi(x_n) \\ \end{bmatrix}

Kernel Regression

\psi: \mathbb{R}^d \rightarrow \mathbb{R}^k

\psi: \mathbb{R}^d \rightarrow \mathbb{R}^k

L(w) = \sum_{i=1}^N \left (y_i - \left ( \sum_j^{N} \alpha_j \psi(x_j) \right ) \cdot \psi(x_i) \right)^2

L(w) = \sum_{i=1}^N \left (y_i - \left ( \sum_j^{N} \alpha_j \psi(x_j) \right ) \cdot \psi(x_i) \right)^2

w = \sum_j^{N} \alpha_j \psi(x_j)

w = \sum_j^{N} \alpha_j \psi(x_j)

L(w) = \sum_{i=1}^N \left (y_i - \alpha^T K(X,x_i) \right)^2

L(w) = \sum_{i=1}^N \left (y_i - \alpha^T K(X,x_i) \right)^2

L(w) = || y - \alpha \hat{K} ||_2^2

L(w) = || y - \alpha \hat{K} ||_2^2

This is exactly Linear regression over alpha

\hat{K} = \begin{bmatrix} \psi(x_1) \psi(x_1) & \psi(x_1) \psi(x_2) & \cdots & \psi(x_1) \psi(x_n) \\ \psi(x_2) \psi(x_1) & \psi(x_2) \psi(x_2) & \cdots & \psi(x_2) \psi(x_n) \\ \vdots & \vdots & \ddots & \vdots \\ \psi(x_n) \psi(x_1) & \psi(x_n) \psi(x_n) & \cdots & \psi(x_n) \psi(x_n) \\ \end{bmatrix}

Kernel Regression

\psi: \mathbb{R}^d \rightarrow \mathbb{R}^k

\psi: \mathbb{R}^d \rightarrow \mathbb{R}^k

L(w) = \sum_{i=1}^N \left (y_i - \left ( \sum_j^{N} \alpha_j \psi(x_j) \right ) \cdot \psi(x_i) \right)^2

L(w) = \sum_{i=1}^N \left (y_i - \left ( \sum_j^{N} \alpha_j \psi(x_j) \right ) \cdot \psi(x_i) \right)^2

w = \sum_j^{N} \alpha_j \psi(x_j)

w = \sum_j^{N} \alpha_j \psi(x_j)

L(w) = \sum_{i=1}^N \left (y_i - \alpha^T K(X,x_i) \right)^2

L(w) = \sum_{i=1}^N \left (y_i - \alpha^T K(X,x_i) \right)^2

L(w) = || y - \alpha \hat{K} ||_2^2

L(w) = || y - \alpha \hat{K} ||_2^2

This is exactly Linear regression over alpha

\alpha \hat{K} = y

\alpha \hat{K} = y

\alpha = y \hat{K}^{\dagger}

\alpha = y \hat{K}^{\dagger}

Kernel Regression

\psi: \mathbb{R}^d \rightarrow \mathbb{R}^k

\psi: \mathbb{R}^d \rightarrow \mathbb{R}^k

L(w) = \sum_{i=1}^N \left (y_i - \left ( \sum_j^{N} \alpha_j \psi(x_j) \right ) \cdot \psi(x_i) \right)^2

L(w) = \sum_{i=1}^N \left (y_i - \left ( \sum_j^{N} \alpha_j \psi(x_j) \right ) \cdot \psi(x_i) \right)^2

w = \sum_j^{N} \alpha_j \psi(x_j)

w = \sum_j^{N} \alpha_j \psi(x_j)

L(w) = \sum_{i=1}^N \left (y_i - \alpha^T K(X,x_i) \right)^2

L(w) = \sum_{i=1}^N \left (y_i - \alpha^T K(X,x_i) \right)^2

L(w) = || y - \alpha \hat{K} ||_2^2

L(w) = || y - \alpha \hat{K} ||_2^2

This is exactly Linear regression over alpha

\alpha \hat{K} = y

\alpha \hat{K} = y

\alpha = y \hat{K}^{\dagger}

\alpha = y \hat{K}^{\dagger}

w = \sum_j^{N} \alpha_j \psi(x_j)

w = \sum_j^{N} \alpha_j \psi(x_j)

f(x) = y K^{\dagger} K(X,x)

f(x) = y K^{\dagger} K(X,x)

NNGP

Neural Network Gaussian Process

B

A

\phi.

\phi.

f(x) = A \phi(B \cdot x)

f(x) = A \phi(B \cdot x)

k

L(w) = \sum_{i=1}^N (y_i - w \cdot A\phi(Bx_i))^2

L(w) = \sum_{i=1}^N (y_i - w \cdot A\phi(Bx_i))^2

Mapping x to R^k

B is initialized Randomly

Fix B, Update A only

B_{ij} \sim \mathcal{N}(0,1)

B_{ij} \sim \mathcal{N}(0,1)

NNGP

Neural Network Gaussian Process

B

A

\phi.

\phi.

f(x) = A \phi(B \cdot x)

f(x) = A \phi(B \cdot x)

L(w) = \sum_{i=1}^N (y_i - A\phi(Bx_i))^2

L(w) = \sum_{i=1}^N (y_i - A\phi(Bx_i))^2

Map, do Linear Regression

B_{ij} \sim \mathcal{N}(0,1)

B_{ij} \sim \mathcal{N}(0,1)

i.e Kernel Regression

L(w) = || y - \alpha \hat{K} ||_2^2

L(w) = || y - \alpha \hat{K} ||_2^2

\alpha \hat{K} = y

\alpha \hat{K} = y

\alpha = y \hat{K}^{\dagger}

\alpha = y \hat{K}^{\dagger}

\lim_{k\rightarrow \infty}

\lim_{k\rightarrow \infty}

A \phi(B \cdot x)

A \phi(B \cdot x)

NNGP

Neural Network Gaussian Process

B

A

\phi.

\phi.

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

L(w) = \sum_{i=1}^N (y_i - w \cdot A\phi(Bx_i))^2

L(w) = \sum_{i=1}^N (y_i - w \cdot A\phi(Bx_i))^2

Map, do Linear Regression

B_{ij} \sim \mathcal{N}(0,1)

B_{ij} \sim \mathcal{N}(0,1)

i.e Kernel Regression

L(w) = || y - \alpha \hat{K} ||_2^2

L(w) = || y - \alpha \hat{K} ||_2^2

\alpha \hat{K} = y

\alpha \hat{K} = y

\alpha = y \hat{K}^{\dagger}

\alpha = y \hat{K}^{\dagger}

\lim_{k\rightarrow \infty}

\lim_{k\rightarrow \infty}

A \phi(B \cdot x)

A \phi(B \cdot x)

How to calculate the kernel matrix K?

NNGP

Neural Network Gaussian Process

B

A

\phi.

\phi.

\lim_{k\rightarrow \infty}

\lim_{k\rightarrow \infty}

How to calculate the kernel matrix K?

L(w) = \sum_{i=1}^N (y_i - A \cdot \frac{1}{\sqrt{k}}\phi(Bx_i))^2

L(w) = \sum_{i=1}^N (y_i - A \cdot \frac{1}{\sqrt{k}}\phi(Bx_i))^2

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

\hat{K} = \begin{bmatrix} \psi(x_1) \psi(x_1) & \psi(x_1) \psi(x_2) & \cdots & \psi(x_1) \psi(x_n) \\ \psi(x_2) \psi(x_1) & \psi(x_2) \psi(x_2) & \cdots & \psi(x_2) \psi(x_n) \\ \vdots & \vdots & \ddots & \vdots \\ \psi(x_n) \psi(x_1) & \psi(x_n) \psi(x_n) & \cdots & \psi(x_n) \psi(x_n) \\ \end{bmatrix}

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

NNGP

Neural Network Gaussian Process

B

A

\phi.

\phi.

\lim_{k\rightarrow \infty}

\lim_{k\rightarrow \infty}

How to calculate the kernel matrix K?

L(w) = \sum_{i=1}^N (y_i - A \cdot \frac{1}{\sqrt{k}}\phi(Bx_i))^2

L(w) = \sum_{i=1}^N (y_i - A \cdot \frac{1}{\sqrt{k}}\phi(Bx_i))^2

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

\hat{K} = \begin{bmatrix} \psi(x_1) \psi(x_1) & \psi(x_1) \psi(x_2) & \cdots & \psi(x_1) \psi(x_n) \\ \psi(x_2) \psi(x_1) & \psi(x_2) \psi(x_2) & \cdots & \psi(x_2) \psi(x_n) \\ \vdots & \vdots & \ddots & \vdots \\ \psi(x_n) \psi(x_1) & \psi(x_n) \psi(x_n) & \cdots & \psi(x_n) \psi(x_n) \\ \end{bmatrix}

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

K(x,\tilde{x}) = \left \langle \frac{1}{\sqrt{k}} \phi(B \cdot x), \frac{1}{\sqrt{k}} \phi(B \cdot \tilde{x}) \right \rangle

K(x,\tilde{x}) = \left \langle \frac{1}{\sqrt{k}} \phi(B \cdot x), \frac{1}{\sqrt{k}} \phi(B \cdot \tilde{x}) \right \rangle

NNGP

Neural Network Gaussian Process

B

A

\phi.

\phi.

\lim_{k\rightarrow \infty}

\lim_{k\rightarrow \infty}

How to calculate the kernel matrix K?

L(w) = \sum_{i=1}^N (y_i - A \cdot \frac{1}{\sqrt{k}}\phi(Bx_i))^2

L(w) = \sum_{i=1}^N (y_i - A \cdot \frac{1}{\sqrt{k}}\phi(Bx_i))^2

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

K(x,\tilde{x}) = \left \langle \frac{1}{\sqrt{k}} \phi(B \cdot x), \frac{1}{\sqrt{k}} \phi(B \cdot \tilde{x}) \right \rangle

K(x,\tilde{x}) = \left \langle \frac{1}{\sqrt{k}} \phi(B \cdot x), \frac{1}{\sqrt{k}} \phi(B \cdot \tilde{x}) \right \rangle

NNGP

Neural Network Gaussian Process

B

A

\phi.

\phi.

\lim_{k\rightarrow \infty}

\lim_{k\rightarrow \infty}

How to calculate the kernel matrix K?

L(w) = \sum_{i=1}^N (y_i - A \cdot \frac{1}{\sqrt{k}}\phi(Bx_i))^2

L(w) = \sum_{i=1}^N (y_i - A \cdot \frac{1}{\sqrt{k}}\phi(Bx_i))^2

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

K(x,\tilde{x}) = \left \langle \frac{1}{\sqrt{k}} \phi(B \cdot x), \frac{1}{\sqrt{k}} \phi(B \cdot \tilde{x}) \right \rangle

K(x,\tilde{x}) = \left \langle \frac{1}{\sqrt{k}} \phi(B \cdot x), \frac{1}{\sqrt{k}} \phi(B \cdot \tilde{x}) \right \rangle

K(x,\tilde{x}) = \frac{1}{k} \left \langle \phi(B \cdot x), \phi(B \cdot \tilde{x}) \right \rangle

K(x,\tilde{x}) = \frac{1}{k} \left \langle \phi(B \cdot x), \phi(B \cdot \tilde{x}) \right \rangle

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot \tilde{x})

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot \tilde{x})

NNGP

Neural Network Gaussian Process

B

A

\phi.

\phi.

\lim_{k\rightarrow \infty}

\lim_{k\rightarrow \infty}

How to calculate the kernel matrix K?

L(w) = \sum_{i=1}^N (y_i - A \cdot \frac{1}{\sqrt{k}}\phi(Bx_i))^2

L(w) = \sum_{i=1}^N (y_i - A \cdot \frac{1}{\sqrt{k}}\phi(Bx_i))^2

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

K(x,\tilde{x}) = \left \langle \frac{1}{\sqrt{k}} \phi(B \cdot x), \frac{1}{\sqrt{k}} \phi(B \cdot \tilde{x}) \right \rangle

K(x,\tilde{x}) = \left \langle \frac{1}{\sqrt{k}} \phi(B \cdot x), \frac{1}{\sqrt{k}} \phi(B \cdot \tilde{x}) \right \rangle

K(x,\tilde{x}) = \frac{1}{k} \left \langle \phi(B \cdot x), \phi(B \cdot \tilde{x}) \right \rangle

K(x,\tilde{x}) = \frac{1}{k} \left \langle \phi(B \cdot x), \phi(B \cdot \tilde{x}) \right \rangle

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot x)

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot x)

DETOUR: SAMPLING FROM DISTRIBUTIONS AND ESTIMATION OF MEAN

NNGP

Neural Network Gaussian Process

B

A

\phi.

\phi.

\lim_{k\rightarrow \infty}

\lim_{k\rightarrow \infty}

How to calculate the kernel matrix K?

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot x)

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot x)

These are samples from Normal Distributions

NNGP

Neural Network Gaussian Process

B

A

\phi.

\phi.

\lim_{k\rightarrow \infty}

\lim_{k\rightarrow \infty}

How to calculate the kernel matrix K?

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot x)

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot x)

These are samples from Normal Distributions

We are summing over Samples from "a function of a Random Variable" and taking average

NNGP

Neural Network Gaussian Process

B

A

\phi.

\phi.

\lim_{k\rightarrow \infty}

\lim_{k\rightarrow \infty}

How to calculate the kernel matrix K?

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot x)

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot x)

These are samples from Normal Distributions

We are summing over Samples from "a function of a Random Variable" and taking average

As we take larger and larger k, the more closer estimate of the mean/Expectation we get

NNGP

Neural Network Gaussian Process

B

A

\phi.

\phi.

\lim_{k\rightarrow \infty}

\lim_{k\rightarrow \infty}

How to calculate the kernel matrix K?

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot x)

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot x)

As we take larger and larger k, the more closer estimate of the mean/Expectation we get

w^T

w^T

w^T

w^T

w \sim \mathcal{N}(0,I)

w \sim \mathcal{N}(0,I)

NNGP

Neural Network Gaussian Process

B

A

\phi.

\phi.

\lim_{k\rightarrow \infty}

\lim_{k\rightarrow \infty}

How to calculate the kernel matrix K?

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot x)

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot x)

As we take larger and larger k, the more closer estimate of the mean/Expectation we get

w^T

w^T

w^T

w^T

w \sim \mathcal{N}(0,I)

w \sim \mathcal{N}(0,I)

K(x,\tilde{x}) = \mathop{\mathbb{E}}_{w \sim \mathcal{N}(0,1)} \left [ \phi(w^T x) \phi(w^T \tilde{x}) \right ]

K(x,\tilde{x}) = \mathop{\mathbb{E}}_{w \sim \mathcal{N}(0,1)} \left [ \phi(w^T x) \phi(w^T \tilde{x}) \right ]

NNGP

Neural Network Gaussian Process

B

A

\phi.

\phi.

\lim_{k\rightarrow \infty}

\lim_{k\rightarrow \infty}

How to calculate the kernel matrix K?

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot x)

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot x)

As we take larger and larger k, the more closer estimate of the mean/Expectation we get

w^T

w^T

w^T

w^T

w \sim \mathcal{N}(0,I)

w \sim \mathcal{N}(0,I)

K(x,\tilde{x}) = \mathop{\mathbb{E}}_{w \sim \mathcal{N}(0,1)} \left [ \phi(w^T x) \phi(w^T \tilde{x}) \right ]

K(x,\tilde{x}) = \mathop{\mathbb{E}}_{w \sim \mathcal{N}(0,1)} \left [ \phi(w^T x) \phi(w^T \tilde{x}) \right ]

K(x,\tilde{x}) = \mathop{\mathbb{E}}_{w \sim \mathcal{N}(0,1)} \left [ \phi(u) \phi(v) \right ]

K(x,\tilde{x}) = \mathop{\mathbb{E}}_{w \sim \mathcal{N}(0,1)} \left [ \phi(u) \phi(v) \right ]

NNGP

Neural Network Gaussian Process

B

A

\phi.

\phi.

\lim_{k\rightarrow \infty}

\lim_{k\rightarrow \infty}

How to calculate the kernel matrix K?

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot x)

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot x)

w^T

w^T

w^T

w^T

K(x,\tilde{x}) = \mathop{\mathbb{E}}_{w \sim \mathcal{N}(0,1)} \left [ \phi(w^T x) \phi(w^T \tilde{x}) \right ]

K(x,\tilde{x}) = \mathop{\mathbb{E}}_{w \sim \mathcal{N}(0,1)} \left [ \phi(w^T x) \phi(w^T \tilde{x}) \right ]

K(x,\tilde{x}) = \mathop{\mathbb{E}} \left [ \phi(u) \phi(v) \right ]

K(x,\tilde{x}) = \mathop{\mathbb{E}} \left [ \phi(u) \phi(v) \right ]

u and v are from different distributions

NNGP

Neural Network Gaussian Process

B

A

\phi.

\phi.

\lim_{k\rightarrow \infty}

\lim_{k\rightarrow \infty}

How to calculate the kernel matrix K?

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot x)

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot x)

w^T

w^T

w^T

w^T

K(x,\tilde{x}) = \mathop{\mathbb{E}}_{w \sim \mathcal{N}(0,1)} \left [ \phi(w^T x) \phi(w^T \tilde{x}) \right ]

K(x,\tilde{x}) = \mathop{\mathbb{E}}_{w \sim \mathcal{N}(0,1)} \left [ \phi(w^T x) \phi(w^T \tilde{x}) \right ]

K(x,\tilde{x}) = \mathop{\mathbb{E}} \left [ \phi(u) \phi(v) \right ]

K(x,\tilde{x}) = \mathop{\mathbb{E}} \left [ \phi(u) \phi(v) \right ]

u and v are from different distributions

u = w^Tx

u = w^Tx

NNGP

Neural Network Gaussian Process

B

A

\phi.

\phi.

\lim_{k\rightarrow \infty}

\lim_{k\rightarrow \infty}

How to calculate the kernel matrix K?

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot x)

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot x)

w^T

w^T

w^T

w^T

K(x,\tilde{x}) = \mathop{\mathbb{E}}_{w \sim \mathcal{N}(0,1)} \left [ \phi(w^T x) \phi(w^T \tilde{x}) \right ]

K(x,\tilde{x}) = \mathop{\mathbb{E}}_{w \sim \mathcal{N}(0,1)} \left [ \phi(w^T x) \phi(w^T \tilde{x}) \right ]

K(x,\tilde{x}) = \mathop{\mathbb{E}} \left [ \phi(u) \phi(v) \right ]

K(x,\tilde{x}) = \mathop{\mathbb{E}} \left [ \phi(u) \phi(v) \right ]

u and v are from different distributions

u = w^Tx

u = w^Tx

\mathbb{E}[u] = \mathbb{E}[w_1x_1 + w_2x_2 \cdots w_d x_d]

\mathbb{E}[u] = \mathbb{E}[w_1x_1 + w_2x_2 \cdots w_d x_d]

NNGP

Neural Network Gaussian Process

B

A

\phi.

\phi.

\lim_{k\rightarrow \infty}

\lim_{k\rightarrow \infty}

How to calculate the kernel matrix K?

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot x)

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot x)

w^T

w^T

w^T

w^T

K(x,\tilde{x}) = \mathop{\mathbb{E}}_{w \sim \mathcal{N}(0,1)} \left [ \phi(w^T x) \phi(w^T \tilde{x}) \right ]

K(x,\tilde{x}) = \mathop{\mathbb{E}}_{w \sim \mathcal{N}(0,1)} \left [ \phi(w^T x) \phi(w^T \tilde{x}) \right ]

K(x,\tilde{x}) = \mathop{\mathbb{E}} \left [ \phi(u) \phi(v) \right ]

K(x,\tilde{x}) = \mathop{\mathbb{E}} \left [ \phi(u) \phi(v) \right ]

u = w^Tx

u = w^Tx

\mathbb{E}[u] = \mathbb{E}[w_1x_1 + w_2x_2 \cdots w_d x_d]

\mathbb{E}[u] = \mathbb{E}[w_1x_1 + w_2x_2 \cdots w_d x_d]

\mathbb{E}[u] = \mathbb{E}[w_1x_1 ] + \mathbb{E}[w_2x_2] \cdots \mathbb{E}[w_d x_d] = 0

\mathbb{E}[u] = \mathbb{E}[w_1x_1 ] + \mathbb{E}[w_2x_2] \cdots \mathbb{E}[w_d x_d] = 0

NNGP

Neural Network Gaussian Process

B

A

\phi.

\phi.

\lim_{k\rightarrow \infty}

\lim_{k\rightarrow \infty}

How to calculate the kernel matrix K?

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot x)

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot x)

w^T

w^T

w^T

w^T

K(x,\tilde{x}) = \mathop{\mathbb{E}}_{w \sim \mathcal{N}(0,1)} \left [ \phi(w^T x) \phi(w^T \tilde{x}) \right ]

K(x,\tilde{x}) = \mathop{\mathbb{E}}_{w \sim \mathcal{N}(0,1)} \left [ \phi(w^T x) \phi(w^T \tilde{x}) \right ]

K(x,\tilde{x}) = \mathop{\mathbb{E}} \left [ \phi(u) \phi(v) \right ]

K(x,\tilde{x}) = \mathop{\mathbb{E}} \left [ \phi(u) \phi(v) \right ]

u = w^Tx

u = w^Tx

\mathbb{E}[u] = \mathbb{E}[w_1x_1 + w_2x_2 \cdots w_d x_d]

\mathbb{E}[u] = \mathbb{E}[w_1x_1 + w_2x_2 \cdots w_d x_d]

\mathbb{E}[u] = \mathbb{E}[w_1x_1 ] + \mathbb{E}[w_2x_2] \cdots \mathbb{E}[w_d x_d] = 0

\mathbb{E}[u] = \mathbb{E}[w_1x_1 ] + \mathbb{E}[w_2x_2] \cdots \mathbb{E}[w_d x_d] = 0

\mathbb{E}[v] = 0

\mathbb{E}[v] = 0

NNGP

Neural Network Gaussian Process

B

A

\phi.

\phi.

\lim_{k\rightarrow \infty}

\lim_{k\rightarrow \infty}

How to calculate the kernel matrix K?

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot x)

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot x)

w^T

w^T

w^T

w^T

K(x,\tilde{x}) = \mathop{\mathbb{E}}_{w \sim \mathcal{N}(0,1)} \left [ \phi(w^T x) \phi(w^T \tilde{x}) \right ]

K(x,\tilde{x}) = \mathop{\mathbb{E}}_{w \sim \mathcal{N}(0,1)} \left [ \phi(w^T x) \phi(w^T \tilde{x}) \right ]

K(x,\tilde{x}) = \mathop{\mathbb{E}} \left [ \phi(u) \phi(v) \right ]

K(x,\tilde{x}) = \mathop{\mathbb{E}} \left [ \phi(u) \phi(v) \right ]

u = w^Tx

u = w^Tx

\mathbb{E}[v] = 0

\mathbb{E}[v] = 0

\mathbb{E}[u] = 0

\mathbb{E}[u] = 0

Var(u) = Var(w_1x_1 + w_2x_2 \cdots w_d x_d) = x_1^2 Var(w_1) + x_2^2 Var(w_2) \cdots x_d^2 Var(w_d)

Var(u) = Var(w_1x_1 + w_2x_2 \cdots w_d x_d) = x_1^2 Var(w_1) + x_2^2 Var(w_2) \cdots x_d^2 Var(w_d)

Var(u) = x_1^2 + x_2^2 \cdots x_d^2 = ||x||_2^2

Var(u) = x_1^2 + x_2^2 \cdots x_d^2 = ||x||_2^2

cov(u,v) = \langle x, \tilde{x} \rangle

cov(u,v) = \langle x, \tilde{x} \rangle

NNGP

Neural Network Gaussian Process

B

A

\phi.

\phi.

\lim_{k\rightarrow \infty}

\lim_{k\rightarrow \infty}

How to calculate the kernel matrix K?

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot x)

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot x)

w^T

w^T

w^T

w^T

K(x,\tilde{x}) = \mathop{\mathbb{E}}_{w \sim \mathcal{N}(0,1)} \left [ \phi(w^T x) \phi(w^T \tilde{x}) \right ]

K(x,\tilde{x}) = \mathop{\mathbb{E}}_{w \sim \mathcal{N}(0,1)} \left [ \phi(w^T x) \phi(w^T \tilde{x}) \right ]

K(x,\tilde{x}) = \mathop{\mathbb{E}} \left [ \phi(u) \phi(v) \right ]

K(x,\tilde{x}) = \mathop{\mathbb{E}} \left [ \phi(u) \phi(v) \right ]

\mathbb{E}[v] = 0

\mathbb{E}[v] = 0

\mathbb{E}[u] = 0

\mathbb{E}[u] = 0

Var(u) = ||x||_2^2

Var(u) = ||x||_2^2

Var(v) = ||\tilde{x}||_2^2

Var(v) = ||\tilde{x}||_2^2

cov(u,v) = \langle x,\tilde{x} \rangle

cov(u,v) = \langle x,\tilde{x} \rangle

NNGP

Neural Network Gaussian Process

B

A

\phi.

\phi.

\lim_{k\rightarrow \infty}

\lim_{k\rightarrow \infty}

How to calculate the kernel matrix K?

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot x)

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot x)

w^T

w^T

w^T

w^T

K(x,\tilde{x}) = \mathop{\mathbb{E}}_{w \sim \mathcal{N}(0,1)} \left [ \phi(w^T x) \phi(w^T \tilde{x}) \right ]

K(x,\tilde{x}) = \mathop{\mathbb{E}}_{w \sim \mathcal{N}(0,1)} \left [ \phi(w^T x) \phi(w^T \tilde{x}) \right ]

K(x,\tilde{x}) = \mathop{\mathbb{E}} \left [ \phi(u) \phi(v) \right ]

K(x,\tilde{x}) = \mathop{\mathbb{E}} \left [ \phi(u) \phi(v) \right ]

u,v \sim \mathcal{N}(0,\Lambda)

u,v \sim \mathcal{N}(0,\Lambda)

\Lambda = \begin{bmatrix} ||x||_2^2 & x \cdot x^T \\ x \cdot x^T & ||\tilde{x}||_2^2 \end{bmatrix}

\Lambda = \begin{bmatrix} ||x||_2^2 & x \cdot x^T \\ x \cdot x^T & ||\tilde{x}||_2^2 \end{bmatrix}

NNGP

Neural Network Gaussian Process

B

A

\phi.

\phi.

\lim_{k\rightarrow \infty}

\lim_{k\rightarrow \infty}

How to calculate the kernel matrix K?

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot x)

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot x)

w^T

w^T

w^T

w^T

K(x,\tilde{x}) = \mathop{\mathbb{E}}_{w \sim \mathcal{N}(0,1)} \left [ \phi(w^T x) \phi(w^T \tilde{x}) \right ]

K(x,\tilde{x}) = \mathop{\mathbb{E}}_{w \sim \mathcal{N}(0,1)} \left [ \phi(w^T x) \phi(w^T \tilde{x}) \right ]

K(x,\tilde{x}) = \mathop{\mathbb{E}} \left [ \phi(u) \phi(v) \right ]

K(x,\tilde{x}) = \mathop{\mathbb{E}} \left [ \phi(u) \phi(v) \right ]

u,v \sim \mathcal{N}(0,\Lambda)

u,v \sim \mathcal{N}(0,\Lambda)

\Lambda = \begin{bmatrix} ||x||_2^2 & x \cdot x^T \\ x \cdot x^T & ||\tilde{x}||_2^2 \end{bmatrix}

\Lambda = \begin{bmatrix} ||x||_2^2 & x \cdot x^T \\ x \cdot x^T & ||\tilde{x}||_2^2 \end{bmatrix}

DUAL Activations

\check{\phi}(\epsilon)

\check{\phi}(\epsilon)

NNGP

Neural Network Gaussian Process

B

A

\phi.

\phi.

\lim_{k\rightarrow \infty}

\lim_{k\rightarrow \infty}

How to calculate the kernel matrix K?

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot x)

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot x)

w^T

w^T

w^T

w^T

K(x,\tilde{x}) = \mathop{\mathbb{E}}_{w \sim \mathcal{N}(0,1)} \left [ \phi(w^T x) \phi(w^T \tilde{x}) \right ]

K(x,\tilde{x}) = \mathop{\mathbb{E}}_{w \sim \mathcal{N}(0,1)} \left [ \phi(w^T x) \phi(w^T \tilde{x}) \right ]

K(x,\tilde{x}) = \mathop{\mathbb{E}} \left [ \phi(u) \phi(v) \right ]

K(x,\tilde{x}) = \mathop{\mathbb{E}} \left [ \phi(u) \phi(v) \right ]

u,v \sim \mathcal{N}(0,\Lambda)

u,v \sim \mathcal{N}(0,\Lambda)

\Lambda = \begin{bmatrix} ||x||_2^2 & x \cdot x^T \\ x \cdot x^T & ||\tilde{x}||_2^2 \end{bmatrix}

\Lambda = \begin{bmatrix} ||x||_2^2 & x \cdot x^T \\ x \cdot x^T & ||\tilde{x}||_2^2 \end{bmatrix}

DUAL Activations

\check{\phi}(\epsilon)

\check{\phi}(\epsilon)

\check{\phi}(\epsilon) = \mathop{\mathbb{E}} \left [ \phi(u) \phi(v) \right ]

\check{\phi}(\epsilon) = \mathop{\mathbb{E}} \left [ \phi(u) \phi(v) \right ]

u,v \sim \mathcal{N}(0,\Lambda)

u,v \sim \mathcal{N}(0,\Lambda)

\Lambda = \begin{bmatrix} 1 & \epsilon\\ \epsilon & 1 \end{bmatrix}

\Lambda = \begin{bmatrix} 1 & \epsilon\\ \epsilon & 1 \end{bmatrix}

NNGP

Neural Network Gaussian Process

B

A

\phi.

\phi.

\lim_{k\rightarrow \infty}

\lim_{k\rightarrow \infty}

How to calculate the kernel matrix K?

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot x)

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot x)

w^T

w^T

w^T

w^T

K(x,\tilde{x}) = \mathop{\mathbb{E}}_{w \sim \mathcal{N}(0,1)} \left [ \phi(w^T x) \phi(w^T \tilde{x}) \right ]

K(x,\tilde{x}) = \mathop{\mathbb{E}}_{w \sim \mathcal{N}(0,1)} \left [ \phi(w^T x) \phi(w^T \tilde{x}) \right ]

K(x,\tilde{x}) = \mathop{\mathbb{E}} \left [ \phi(u) \phi(v) \right ]

K(x,\tilde{x}) = \mathop{\mathbb{E}} \left [ \phi(u) \phi(v) \right ]

u,v \sim \mathcal{N}(0,\Lambda)

u,v \sim \mathcal{N}(0,\Lambda)

\Lambda = \begin{bmatrix} ||x||_2^2 & x \cdot x^T \\ x \cdot x^T & ||\tilde{x}||_2^2 \end{bmatrix}

\Lambda = \begin{bmatrix} ||x||_2^2 & x \cdot x^T \\ x \cdot x^T & ||\tilde{x}||_2^2 \end{bmatrix}

\check{\phi}(x \tilde{x}^T)

\check{\phi}(x \tilde{x}^T)

If data is normalized

NNGP

Neural Network Gaussian Process

B

A

\phi.

\phi.

\lim_{k\rightarrow \infty}

\lim_{k\rightarrow \infty}

How to calculate the kernel matrix K?

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

f(x) = \frac{1}{\sqrt{k}} A \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

\psi(x) = \frac{1}{\sqrt{k}} \phi(B \cdot x)

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot x)

K(x,\tilde{x}) = \frac{1}{k} \sum_i^{k} \phi(B_{i,:} \cdot x ) \cdot \phi(B_{i,:} \cdot x)

w^T

w^T

w^T

w^T

K(x,\tilde{x}) = \mathop{\mathbb{E}}_{w \sim \mathcal{N}(0,1)} \left [ \phi(w^T x) \phi(w^T \tilde{x}) \right ]

K(x,\tilde{x}) = \mathop{\mathbb{E}}_{w \sim \mathcal{N}(0,1)} \left [ \phi(w^T x) \phi(w^T \tilde{x}) \right ]

NTK

f(x) = \frac{1}{\sqrt{k}} A \phi(Bx)

f(x) = \frac{1}{\sqrt{k}} A \phi(Bx)

f(w; x) = \frac{1}{\sqrt{k}} A \phi(Bx)

f(w; x) = \frac{1}{\sqrt{k}} A \phi(Bx)

w = \begin{bmatrix} A_{11} &\cdots & A_{1k} | B_{11} & \cdots B_{kd} \end{bmatrix}

w = \begin{bmatrix} A_{11} &\cdots & A_{1k} | B_{11} & \cdots B_{kd} \end{bmatrix}

w \in \mathbb{R}^{k+kd}

w \in \mathbb{R}^{k+kd}

NTK

f(x) = \frac{1}{\sqrt{k}} A \phi(Bx)

f(x) = \frac{1}{\sqrt{k}} A \phi(Bx)

f(w; x) = \frac{1}{\sqrt{k}} A \phi(Bx)

f(w; x) = \frac{1}{\sqrt{k}} A \phi(Bx)

w = \begin{bmatrix} A_{11} &\cdots & A_{1k} | B_{11} & \cdots B_{kd} \end{bmatrix}

w = \begin{bmatrix} A_{11} &\cdots & A_{1k} | B_{11} & \cdots B_{kd} \end{bmatrix}

"Linearize f around some initialization w"

NTK

f(x) = \frac{1}{\sqrt{k}} A \phi(Bx)

f(x) = \frac{1}{\sqrt{k}} A \phi(Bx)

f(w; x) = \frac{1}{\sqrt{k}} A \phi(Bx)

f(w; x) = \frac{1}{\sqrt{k}} A \phi(Bx)

w = \begin{bmatrix} A_{11} &\cdots & A_{1k} | B_{11} & \cdots B_{kd} \end{bmatrix}

w = \begin{bmatrix} A_{11} &\cdots & A_{1k} | B_{11} & \cdots B_{kd} \end{bmatrix}

"Linearize f around some initialization w"

f_x(w) = f_x(w^0) + \nabla f_x(w^0)^T (w-w_0)

f_x(w) = f_x(w^0) + \nabla f_x(w^0)^T (w-w_0)

NTK

f(x) = \frac{1}{\sqrt{k}} A \phi(Bx)

f(x) = \frac{1}{\sqrt{k}} A \phi(Bx)

f(w; x) = \frac{1}{\sqrt{k}} A \phi(Bx)

f(w; x) = \frac{1}{\sqrt{k}} A \phi(Bx)

w = \begin{bmatrix} A_{11} &\cdots & A_{1k} | B_{11} & \cdots B_{kd} \end{bmatrix}

w = \begin{bmatrix} A_{11} &\cdots & A_{1k} | B_{11} & \cdots B_{kd} \end{bmatrix}

"Linearize f around some initialization w"

f_x(w) = f_x(w^0) + \nabla f_x(w^0)^T (w-w_0)

f_x(w) = f_x(w^0) + \nabla f_x(w^0)^T (w-w_0)

\nabla f_x = \begin{bmatrix} \frac{\partial f}{\partial A_{11}} \\ \frac{\partial f}{\partial A_{12}} \\ \vdots \\ \frac{\partial f}{\partial B_{kd}} \\ \end{bmatrix}

\nabla f_x = \begin{bmatrix} \frac{\partial f}{\partial A_{11}} \\ \frac{\partial f}{\partial A_{12}} \\ \vdots \\ \frac{\partial f}{\partial B_{kd}} \\ \end{bmatrix}

NTK

f(x) = \frac{1}{\sqrt{k}} A \phi(Bx)

f(x) = \frac{1}{\sqrt{k}} A \phi(Bx)

f(w; x) = \frac{1}{\sqrt{k}} A \phi(Bx)

f(w; x) = \frac{1}{\sqrt{k}} A \phi(Bx)

f_x(w) = f_x(w^0) + \nabla f_x(w^0)^T (w-w_0)

f_x(w) = f_x(w^0) + \nabla f_x(w^0)^T (w-w_0)

Unknown Weights

Can be 0

Then This is a linear model in w

L(w) = \sum_{i=1}^N (y_i - \langle w , \nabla f_{x_i} (w^0)\rangle)^2

L(w) = \sum_{i=1}^N (y_i - \langle w , \nabla f_{x_i} (w^0)\rangle)^2

NTK

f(x) = \frac{1}{\sqrt{k}} A \phi(Bx)

f(x) = \frac{1}{\sqrt{k}} A \phi(Bx)

f(w; x) = \frac{1}{\sqrt{k}} A \phi(Bx)

f(w; x) = \frac{1}{\sqrt{k}} A \phi(Bx)

f_x(w) = f_x(w^0) + \nabla f_x(w^0)^T (w-w_0)

f_x(w) = f_x(w^0) + \nabla f_x(w^0)^T (w-w_0)

Unknown Weights

Can be 0

Then This is a linear model in w

L(w) = \sum_{i=1}^N (y_i - \langle w , \nabla f_x (w^0)\rangle)^2

L(w) = \sum_{i=1}^N (y_i - \langle w , \nabla f_x (w^0)\rangle)^2

This can be intepreted as mapping the data and solving a linear model around w0

NTK

f(x) = \frac{1}{\sqrt{k}} A \phi(Bx)

f(x) = \frac{1}{\sqrt{k}} A \phi(Bx)

f(w; x) = \frac{1}{\sqrt{k}} A \phi(Bx)

f(w; x) = \frac{1}{\sqrt{k}} A \phi(Bx)

f_x(w) = f_x(w^0) + \nabla f_x(w^0)^T (w-w_0)

f_x(w) = f_x(w^0) + \nabla f_x(w^0)^T (w-w_0)

L(w) = \sum_{i=1}^N (y_i - \langle w , \nabla f_x (w^0)\rangle)^2

L(w) = \sum_{i=1}^N (y_i - \langle w , \nabla f_x (w^0)\rangle)^2

\phi: x \rightarrow \nabla f_x (w^0)

\phi: x \rightarrow \nabla f_x (w^0)

NTK

f(x) = \frac{1}{\sqrt{k}} A \phi(Bx)

f(x) = \frac{1}{\sqrt{k}} A \phi(Bx)

f(w; x) = \frac{1}{\sqrt{k}} A \phi(Bx)

f(w; x) = \frac{1}{\sqrt{k}} A \phi(Bx)

f_x(w) = f_x(w^0) + \nabla f_x(w^0)^T (w-w_0)

f_x(w) = f_x(w^0) + \nabla f_x(w^0)^T (w-w_0)

L(w) = \sum_{i=1}^N (y_i - \langle w , \nabla f_x (w^0)\rangle)^2

L(w) = \sum_{i=1}^N (y_i - \langle w , \nabla f_x (w^0)\rangle)^2

\phi: x \rightarrow \nabla f_x (w^0)

\phi: x \rightarrow \nabla f_x (w^0)

K(x,\tilde{x}) = \langle \nabla f_x(w^0) , \nabla f_{\tilde{x}}(w^0) \rangle

K(x,\tilde{x}) = \langle \nabla f_x(w^0) , \nabla f_{\tilde{x}}(w^0) \rangle

NTK

f(x) = \frac{1}{\sqrt{k}} A \phi(Bx)

f(x) = \frac{1}{\sqrt{k}} A \phi(Bx)

f(w; x) = \frac{1}{\sqrt{k}} A \phi(Bx)

f(w; x) = \frac{1}{\sqrt{k}} A \phi(Bx)

f_x(w) = f_x(w^0) + \nabla f_x(w^0)^T (w-w_0)

f_x(w) = f_x(w^0) + \nabla f_x(w^0)^T (w-w_0)

L(w) = \sum_{i=1}^N (y_i - \langle w , \nabla f_x (w^0)\rangle)^2

L(w) = \sum_{i=1}^N (y_i - \langle w , \nabla f_x (w^0)\rangle)^2

\phi: x \rightarrow \nabla f_x (w^0)

\phi: x \rightarrow \nabla f_x (w^0)

K(x,\tilde{x}) = \langle \nabla f_x(w^0) , \nabla f_{\tilde{x}}(w^0) \rangle

K(x,\tilde{x}) = \langle \nabla f_x(w^0) , \nabla f_{\tilde{x}}(w^0) \rangle

Neural Tangent Kernel

NTK

K(x,\tilde{x}) = \langle \nabla f_x(w^0) , \nabla f_{\tilde{x}}(w^0) \rangle

K(x,\tilde{x}) = \langle \nabla f_x(w^0) , \nabla f_{\tilde{x}}(w^0) \rangle

f(w;x) = A \phi(Bx) = \sum_i^k A_{1i} \phi(B_{i,:} x)

f(w;x) = A \phi(Bx) = \sum_i^k A_{1i} \phi(B_{i,:} x)

\frac{\partial f}{\partial A_{1i}} = \sum_i \phi(B_{i,:} x)

\frac{\partial f}{\partial A_{1i}} = \sum_i \phi(B_{i,:} x)

\frac{\partial f}{\partial B_{ij}} = \sum_i A_{1i} \phi'(B_{i,:} x) x_j

\frac{\partial f}{\partial B_{ij}} = \sum_i A_{1i} \phi'(B_{i,:} x) x_j

NTK

K(x,\tilde{x}) = \langle \nabla f_x(w^0) , \nabla f_{\tilde{x}}(w^0) \rangle

K(x,\tilde{x}) = \langle \nabla f_x(w^0) , \nabla f_{\tilde{x}}(w^0) \rangle

f(w;x) = A \phi(Bx) = \sum_i^k A_{1i} \phi(B_{i,:} x)

f(w;x) = A \phi(Bx) = \sum_i^k A_{1i} \phi(B_{i,:} x)

\frac{\partial f}{\partial A_{1i}} = \sum_i \phi(B_{i,:} x)

\frac{\partial f}{\partial A_{1i}} = \sum_i \phi(B_{i,:} x)

\frac{\partial f}{\partial B_{ij}} = \sum_i A_{1i} \phi'(B_{i,:} x) x_j

\frac{\partial f}{\partial B_{ij}} = \sum_i A_{1i} \phi'(B_{i,:} x) x_j

From Linear Systems to NNs Aadharsh Aadhithya A

deck

By Incredeble us

deck

a year ago
106

From Linear Systems to NNs

deck

More from Incredeble us