Online Kernel matrix factorization

Andrés Esteban Páez Torres

MindLab

Universidad Nacional de Colombia

Motivation

In the last years the growth of information production has rendered impossible its processing and analysis.
300 video hours uploaded to Youtube per minute.
350,000 tweets per minute published in Twitter.

Motivation

There are many well studied and tested matrix factorization methods.
These MF methods are specially apt for dimensionality reduction, manifold learning, dictionary learning and clustering tasks.

Motivation

Matrix factorization kernel methods are very useful in data analysis and machine learning tasks.
However, KMF have high computational cost.

Problem

The challenge is to devise effective and efficient mechanisms to perform matrix factorization in high dimensional feature spaces implicitly defined by kernels.

Method

Let's consider the following factorization

\Phi(X)_{p\times n}=\Phi(X)_{p\times n}W_{n\times r}H_{r\times n}

\Phi(X)_{p\times n}=\Phi(X)_{p\times n}W_{n\times r}H_{r\times n}

And its reconstruction error

\Vert\Phi(X)-\Phi(X)WH\Vert^2

\Vert\Phi(X)-\Phi(X)WH\Vert^2

Method

Which we can express as

Tr(\Phi(X)^T\Phi(X)-2\Phi(X)^T\Phi(X)WH+H^TW^T\Phi(X)^T\Phi(X)WH)

Tr(\Phi(X)^T\Phi(X)-2\Phi(X)^T\Phi(X)WH+H^TW^T\Phi(X)^T\Phi(X)WH)

Using the kernel function

Tr(\color{red}{k(X,X)}-2\color{red}{k(X,X)}WH+H^TW^T\color{red}{k(X,X)}WH)

Tr(\color{red}{k(X,X)}-2\color{red}{k(X,X)}WH+H^TW^T\color{red}{k(X,X)}WH)

Method

But the terms have a dimension

which leads to the impossibility to apply the factorization to a large number of samples

k(X,X)

k(X,X)

n\times n

n\times n

Method

To solve that problem let's consider the following factorization

\Phi(X)_{p\times n}=\Phi(B)_{p\times l}W_{n\times r}H_{r\times n}

\Phi(X)_{p\times n}=\Phi(B)_{p\times l}W_{n\times r}H_{r\times n}

Method

With the factorization stated we pose this optimization problem using stochastic gradient descent

J = \|\Phi(x_i)-\Phi(B)Wh_i\|^2 + \lambda \| W \|^2 + \alpha \| h_i \|^2

J = \|\Phi(x_i)-\Phi(B)Wh_i\|^2 + \lambda \| W \|^2 + \alpha \| h_i \|^2

Method

In order to use SGD we must calculate the gradient

\frac{\partial J}{\partial W}=k(B,x_i)h_i^T-k(B,B)Wh_ih_i^T + \lambda W

\frac{\partial J}{\partial W}=k(B,x_i)h_i^T-k(B,B)Wh_ih_i^T + \lambda W

\frac{\partial J}{\partial H}=W^Tk(B,x_i)-W^Tk(B,B)Wh_i+\alpha h_i

\frac{\partial J}{\partial H}=W^Tk(B,x_i)-W^Tk(B,B)Wh_i+\alpha h_i

Method

With the gradients we get the update rules

W = W - \gamma\frac{\partial J}{\partial W}

W = W - \gamma\frac{\partial J}{\partial W}

h_i=(W^Tk(B,B)W-\alpha I)^{-1}W^Tk(B,x_i)

h_i=(W^Tk(B,B)W-\alpha I)^{-1}W^Tk(B,x_i)

Online Kernel matrix factorization

Motivation

Motivation

Motivation

Problem

Method

Method

Method

Method

Method

Method

Method

Online Kernel Matrix Factorization

Online Kernel Matrix Factorization

Andres Esteban Paez Torres

Online Kernel matrix factorization

Motivation

Motivation

Motivation

Problem

Method

Method

Method

Method

Method

Method

Method

Online Kernel Matrix Factorization

More from Andres Esteban Paez Torres