Serif

Fast Differentiable Sorting and Ranking

ICML 2020

presented by Piotr Kozakowski

Sorting and ranking are important.

But not differentiable.

Sorting and ranking are important.

But not differentiable.

Sorting: piecewise linear - continuous, derivatives constant, zero or undefined.

Sorting and ranking are important.

But not differentiable.

Sorting: piecewise linear - continuous, derivatives constant, zero or undefined.

Ranking: piecewise constant - discontinuous, derivatives zero or undefined.

Goal: construct differentiable approximations of sorting and ranking.

Definitions

\theta \in \mathbb{R}^n~\text{- vector to sort}

\sigma(\theta) \in \Sigma~\text{- \textbf{argsort} of $\theta$, s.t. $\theta_{\sigma_1(\theta)} \geq ... \geq \theta_{\sigma_n(\theta)}$}

s(\theta) = \theta_{\sigma(\theta)} \in \mathbb{R}^n~\text{- \textbf{sort} of $\theta$}

r(\theta) = \sigma^{-1}(\theta) \in \Sigma~\text{- \textbf{rank} of $\theta$}

Example

\theta = (1.2, 0.1, 2.9)

\sigma(\theta) = (3, 1, 2)

s(\theta) = \theta_{\sigma(\theta)} = (2.9, 1.2, 0.1)

r(\theta) = \sigma^{-1}(\theta) = (2, 3, 1)

Discrete optimization formulations

s(\theta) = \mathrm{argmax}_{\sigma \in \Sigma} \langle \theta_\sigma, \rho \rangle \\ r(\theta) = \mathrm{argmax}_{\pi \in \Sigma} \langle \theta, \rho_\pi \rangle \\

\text{where}~\rho = (n, n - 1, ..., 1)

Permutahedron

\mathcal{P}(w) = \mathrm{conv}(w_\sigma : \sigma \in \Sigma)

Convex hull of permutations on w.

Permutahedron in 3d

source: the paper

Permutahedron in 4d

source: Wikipedia

Linear programming formulations

s(\theta) = \mathrm{argmax}_{\sigma \in \Sigma} \langle \theta_\sigma, \rho \rangle \\ r(\theta) = \mathrm{argmax}_{\pi \in \Sigma} \langle \theta, \rho_\pi \rangle \\

\text{where}~\rho = (n, n - 1, ..., 1)

\Downarrow

s(\theta) = \mathrm{argmax}_{y \in \mathcal{P}(\theta)} \langle y, \rho \rangle \\ r(\theta) = \mathrm{argmax}_{\pi \in \mathcal{P}(\rho)} \langle y, -\theta \rangle \\

Linear programming formulations

s(\theta) = \mathrm{argmax}_{\sigma \in \Sigma} \langle \theta_\sigma, \rho \rangle \\ r(\theta) = \mathrm{argmax}_{\pi \in \Sigma} \langle \theta, \rho_\pi \rangle \\

\text{where}~\rho = (n, n - 1, ..., 1)

\Downarrow

s(\theta) = \mathrm{argmax}_{y \in \mathcal{P}(\theta)} \langle y, \rho \rangle \\ r(\theta) = \mathrm{argmax}_{\pi \in \mathcal{P}(\rho)} \langle y, -\theta \rangle \\

same solutions

from the Fundamental Theorem of Linear Programming

Generalization

\text{where}~\rho = (n, n - 1, ..., 1)

\Downarrow

s(\theta) = \mathrm{argmax}_{y \in \mathcal{P}(\theta)} \langle y, \rho \rangle \\ r(\theta) = \mathrm{argmax}_{\pi \in \mathcal{P}(\rho)} \langle y, -\theta \rangle \\

P(z, w) = \mathrm{argmax}_{\mu \in \mathcal{P}(w)} \langle \mu, z \rangle \\ s(\theta) = P(\rho, \theta) \\ r(\theta) = P(-\theta, \rho)

Regularization

P(z, w) = \mathrm{argmax}_{\mu \in \mathcal{P}(w)} \langle \mu, z \rangle

\Downarrow

P_Q(z, w) = \mathrm{argmax}_{\mu \in \mathcal{P}(w)} \langle \mu, z \rangle - ||z||^2

Regularization

P(z, w) = \mathrm{argmax}_{\mu \in \mathcal{P}(w)} \langle \mu, z \rangle

\Downarrow

P_Q(z, w) = \mathrm{argmax}_{\mu \in \mathcal{P}(w)} \langle \mu, z \rangle - ||z||^2

= \mathrm{argmin}_{\mu \in \mathcal{P}(w)} ||\mu - z||^2

Regularization

P(z, w) = \mathrm{argmax}_{\mu \in \mathcal{P}(w)} \langle \mu, z \rangle

\Downarrow

P_Q(z, w) = \mathrm{argmax}_{\mu \in \mathcal{P}(w)} \langle \mu, z \rangle - ||z||^2

= \mathrm{argmin}_{\mu \in \mathcal{P}(w)} ||\mu - z||^2

Euclidean projection of z onto the permutahedron!

Regularization

P_{\epsilon Q}(z, w) = \mathrm{argmax}_{\mu \in \mathcal{P}(w)} \langle \mu, z \rangle - \epsilon ||z||^2

= \mathrm{argmin}_{\mu \in \mathcal{P}(w)} ||\mu - z/\epsilon||^2

\epsilon~\text{- regularization strength}

Regularization

P_{\epsilon E}(z, w) = \log \mathrm{argmax}_{\mu \in \mathcal{P}(e^w)} \langle \mu, z \rangle - \epsilon \langle \mu, \log \mu - 1 \rangle

(not going to talk about that)

How does it work?

r_{\epsilon Q}(\theta) = P_{\epsilon Q}(-\theta, \rho) = \mathrm{argmin}_{\mu \in \mathcal{P}(\rho)} ||\mu - (-\theta)/\epsilon||^2 \\ \rho = (3, 2, 1)

Properties

Effect of regularization strength

\theta = (0, 3, 1, 2) \\ s(\theta) = (3, 2, 1, 0) \\ r(\theta) = (4, 1, 3, 2)

Reduction to isotonic regression

TLDR: we can pose the problem as isotonic regression

and solve it in O(n log n) time and O(n) space.

And we can multiply with the Jacobian in O(n).

(it's sparse)

Experiment: top-1 classification on CIFAR

Experiment: robust regression

Idea: sort the losses and ignore the top k of them.