Fast Differentiable Sorting and Ranking

ICML 2020

presented by Piotr Kozakowski

Sorting and ranking are important.

But not differentiable.

Sorting and ranking are important.

But not differentiable.

Sorting: piecewise linear - continuous, derivatives constant, zero or undefined.

Sorting and ranking are important.

But not differentiable.

Sorting: piecewise linear - continuous, derivatives constant, zero or undefined.

Ranking: piecewise constant - discontinuous, derivatives zero or undefined.

Goal: construct differentiable approximations of sorting and ranking.

Definitions

\theta \in \mathbb{R}^n~\text{- vector to sort}

\sigma(\theta) \in \Sigma~\text{- \textbf{argsort} of $\theta$, s.t. $\theta_{\sigma_1(\theta)} \geq ... \geq \theta_{\sigma_n(\theta)}$}

s(\theta) = \theta_{\sigma(\theta)} \in \mathbb{R}^n~\text{- \textbf{sort} of $\theta$}

r(\theta) = \sigma^{-1}(\theta) \in \Sigma~\text{- \textbf{rank} of $\theta$}

Example

\theta = (1.2, 0.1, 2.9)

\sigma(\theta) = (3, 1, 2)

s(\theta) = \theta_{\sigma(\theta)} = (2.9, 1.2, 0.1)

r(\theta) = \sigma^{-1}(\theta) = (2, 3, 1)

Discrete optimization formulations

s(\theta) = \mathrm{argmax}_{\sigma \in \Sigma} \langle \theta_\sigma, \rho \rangle \\ r(\theta) = \mathrm{argmax}_{\pi \in \Sigma} \langle \theta, \rho_\pi \rangle \\

\text{where}~\rho = (n, n - 1, ..., 1)

Permutahedron

\mathcal{P}(w) = \mathrm{conv}(w_\sigma : \sigma \in \Sigma)

Convex hull of permutations on w.

Permutahedron in 3d

source: the paper

Permutahedron in 4d

source: Wikipedia

Linear programming formulations

s(\theta) = \mathrm{argmax}_{\sigma \in \Sigma} \langle \theta_\sigma, \rho \rangle \\ r(\theta) = \mathrm{argmax}_{\pi \in \Sigma} \langle \theta, \rho_\pi \rangle \\

\text{where}~\rho = (n, n - 1, ..., 1)

\Downarrow

s(\theta) = \mathrm{argmax}_{y \in \mathcal{P}(\theta)} \langle y, \rho \rangle \\ r(\theta) = \mathrm{argmax}_{\pi \in \mathcal{P}(\rho)} \langle y, -\theta \rangle \\

Linear programming formulations

s(\theta) = \mathrm{argmax}_{\sigma \in \Sigma} \langle \theta_\sigma, \rho \rangle \\ r(\theta) = \mathrm{argmax}_{\pi \in \Sigma} \langle \theta, \rho_\pi \rangle \\

\text{where}~\rho = (n, n - 1, ..., 1)

\Downarrow

s(\theta) = \mathrm{argmax}_{y \in \mathcal{P}(\theta)} \langle y, \rho \rangle \\ r(\theta) = \mathrm{argmax}_{\pi \in \mathcal{P}(\rho)} \langle y, -\theta \rangle \\

same solutions

from the Fundamental Theorem of Linear Programming

Generalization

\text{where}~\rho = (n, n - 1, ..., 1)

\Downarrow

s(\theta) = \mathrm{argmax}_{y \in \mathcal{P}(\theta)} \langle y, \rho \rangle \\ r(\theta) = \mathrm{argmax}_{\pi \in \mathcal{P}(\rho)} \langle y, -\theta \rangle \\

P(z, w) = \mathrm{argmax}_{\mu \in \mathcal{P}(w)} \langle \mu, z \rangle \\ s(\theta) = P(\rho, \theta) \\ r(\theta) = P(-\theta, \rho)

Regularization

P(z, w) = \mathrm{argmax}_{\mu \in \mathcal{P}(w)} \langle \mu, z \rangle

\Downarrow

P_Q(z, w) = \mathrm{argmax}_{\mu \in \mathcal{P}(w)} \langle \mu, z \rangle - ||z||^2

Regularization

P(z, w) = \mathrm{argmax}_{\mu \in \mathcal{P}(w)} \langle \mu, z \rangle

\Downarrow

P_Q(z, w) = \mathrm{argmax}_{\mu \in \mathcal{P}(w)} \langle \mu, z \rangle - ||z||^2

= \mathrm{argmin}_{\mu \in \mathcal{P}(w)} ||\mu - z||^2

Regularization

P(z, w) = \mathrm{argmax}_{\mu \in \mathcal{P}(w)} \langle \mu, z \rangle

\Downarrow

P_Q(z, w) = \mathrm{argmax}_{\mu \in \mathcal{P}(w)} \langle \mu, z \rangle - ||z||^2

= \mathrm{argmin}_{\mu \in \mathcal{P}(w)} ||\mu - z||^2

Euclidean projection of z onto the permutahedron!

Regularization

P_{\epsilon Q}(z, w) = \mathrm{argmax}_{\mu \in \mathcal{P}(w)} \langle \mu, z \rangle - \epsilon ||z||^2

= \mathrm{argmin}_{\mu \in \mathcal{P}(w)} ||\mu - z/\epsilon||^2

\epsilon~\text{- regularization strength}

Regularization

P_{\epsilon E}(z, w) = \log \mathrm{argmax}_{\mu \in \mathcal{P}(e^w)} \langle \mu, z \rangle - \epsilon \langle \mu, \log \mu - 1 \rangle

(not going to talk about that)

How does it work?

r_{\epsilon Q}(\theta) = P_{\epsilon Q}(-\theta, \rho) = \mathrm{argmin}_{\mu \in \mathcal{P}(\rho)} ||\mu - (-\theta)/\epsilon||^2 \\ \rho = (3, 2, 1)

Properties

Effect of regularization strength

\theta = (0, 3, 1, 2) \\ s(\theta) = (3, 2, 1, 0) \\ r(\theta) = (4, 1, 3, 2)

Reduction to isotonic regression

TLDR: we can pose the problem as isotonic regression

and solve it in O(n log n) time and O(n) space.

And we can multiply with the Jacobian in O(n).

(it's sparse)

Fast Differentiable Sorting and Ranking

ICML 2020

Definitions

Example

Discrete optimization formulations

Permutahedron

Permutahedron in 3d

Permutahedron in 4d

Linear programming formulations

Linear programming formulations

Generalization

Regularization

Regularization

Regularization

Regularization

Regularization

How does it work?

Properties

Effect of regularization strength

Reduction to isotonic regression

Experiment: top-1 classification on CIFAR

Experiment: robust regression

Serif

Serif

Piotr Kozakowski

Fast Differentiable Sorting and Ranking

ICML 2020

Definitions

Example

Discrete optimization formulations

Permutahedron

Permutahedron in 3d

Permutahedron in 4d

Linear programming formulations

Linear programming formulations

Generalization

Regularization

Regularization

Regularization

Regularization

Regularization

How does it work?

Properties

Effect of regularization strength

Reduction to isotonic regression

Experiment: top-1 classification on CIFAR

Experiment: robust regression

Serif

More from Piotr Kozakowski