Online Kernel matrix factorization

Andrés Esteban Páez Torres

MindLab

Universidad Nacional de Colombia

Motivation

  • In the last years the growth of information production has rendered impossible its processing and analysis.
  • 300 video hours uploaded to Youtube per minute.
  • 350,000 tweets per minute published in Twitter.

Motivation

  • There are many well studied and tested matrix factorization methods.
  • These MF methods are specially apt for dimensionality reduction, manifold learning, dictionary learning and clustering tasks.

Motivation

  • Matrix factorization kernel methods are very useful in data analysis and machine learning tasks.
  • However, KMF have high computational cost.

Problem

The challenge is to devise effective and efficient mechanisms to perform matrix factorization in high dimensional feature spaces implicitly defined by kernels.

Method

Let's consider the following factorization

\Phi(X)_{p\times n}=\Phi(X)_{p\times n}W_{n\times r}H_{r\times n}
Φ(X)p×n=Φ(X)p×nWn×rHr×n\Phi(X)_{p\times n}=\Phi(X)_{p\times n}W_{n\times r}H_{r\times n}

And its reconstruction error

\Vert\Phi(X)-\Phi(X)WH\Vert^2
Φ(X)Φ(X)WH2\Vert\Phi(X)-\Phi(X)WH\Vert^2

Method

Which we can express as

Tr(\Phi(X)^T\Phi(X)-2\Phi(X)^T\Phi(X)WH+H^TW^T\Phi(X)^T\Phi(X)WH)
Tr(Φ(X)TΦ(X)2Φ(X)TΦ(X)WH+HTWTΦ(X)TΦ(X)WH)Tr(\Phi(X)^T\Phi(X)-2\Phi(X)^T\Phi(X)WH+H^TW^T\Phi(X)^T\Phi(X)WH)

Using the kernel function

Tr(\color{red}{k(X,X)}-2\color{red}{k(X,X)}WH+H^TW^T\color{red}{k(X,X)}WH)
Tr(k(X,X)2k(X,X)WH+HTWTk(X,X)WH)Tr(\color{red}{k(X,X)}-2\color{red}{k(X,X)}WH+H^TW^T\color{red}{k(X,X)}WH)

Method

But the terms                 have a dimension

which leads to the impossibility to apply the factorization to a large number of samples 

k(X,X)
k(X,X)k(X,X)
n\times n
n×nn\times n

Method

To solve that problem let's consider the following factorization

\Phi(X)_{p\times n}=\Phi(B)_{p\times l}W_{n\times r}H_{r\times n}
Φ(X)p×n=Φ(B)p×lWn×rHr×n\Phi(X)_{p\times n}=\Phi(B)_{p\times l}W_{n\times r}H_{r\times n}

Method

With the factorization stated we pose this optimization problem using stochastic gradient descent

J = \|\Phi(x_i)-\Phi(B)Wh_i\|^2 + \lambda \| W \|^2 + \alpha \| h_i \|^2
J=Φ(xi)Φ(B)Whi2+λW2+αhi2J = \|\Phi(x_i)-\Phi(B)Wh_i\|^2 + \lambda \| W \|^2 + \alpha \| h_i \|^2

Method

In order to use SGD we must calculate the gradient

\frac{\partial J}{\partial W}=k(B,x_i)h_i^T-k(B,B)Wh_ih_i^T + \lambda W
JW=k(B,xi)hiTk(B,B)WhihiT+λW\frac{\partial J}{\partial W}=k(B,x_i)h_i^T-k(B,B)Wh_ih_i^T + \lambda W
\frac{\partial J}{\partial H}=W^Tk(B,x_i)-W^Tk(B,B)Wh_i+\alpha h_i
JH=WTk(B,xi)WTk(B,B)Whi+αhi\frac{\partial J}{\partial H}=W^Tk(B,x_i)-W^Tk(B,B)Wh_i+\alpha h_i

Method

With the gradients we get the update rules

W = W - \gamma\frac{\partial J}{\partial W}
W=WγJWW = W - \gamma\frac{\partial J}{\partial W}
h_i=(W^Tk(B,B)W-\alpha I)^{-1}W^Tk(B,x_i)
hi=(WTk(B,B)WαI)1WTk(B,xi)h_i=(W^Tk(B,B)W-\alpha I)^{-1}W^Tk(B,x_i)

Online Kernel Matrix Factorization

By Andres Esteban Paez Torres

Online Kernel Matrix Factorization

My thesis presentation

  • 912