Introduction to automatic differentiation

Pierre Ablin

Introduction to automatic differentiation

[Baydin et al., 2015, Automatic differentiation in machine learning: a survey]

https://arxiv.org/abs/1502.05767

Automatic differentiation ?

- Method to compute the differential of a function using  a computer 

Input

def f(x):
  return x ** 2


f(1.)
>>>> 1.

Output

g = grad(f)


g(1.)
>>>> 2.0

Automatic differentiation ?

- Method to compute the differential of a function using  a computer 

Input

def f(x):
  return np.log(1 + x ** 2) / x


f(1.)
>>>> 0.6931471805599453

Output

g = grad(f)


g(1.)
>>>> 0.3068528194400547

Prototypical case 

\(f\) defined recursively:

  • \(f_0(x) = x\)
  • \(f_{k+1}(x)  = 4 f_{k}(x) (1 - f_k(x))\)

Input

def f(x, n=4):
  v = x
  for i in range(n):
    v = 4 * v * (1 - v)
  return v


f(0.25)
>>>> 0.75

Output

g = grad(f)


g(0.25)
>>>> -16.0

Automatic differentiation is not...

Numerical differentiation 

$$f'(x) \simeq \frac{f(x + h) - f(x)}{h}$$

In higher dimension:

$$ \frac{\partial f} {\partial x_i} (\mathbf{x}) \simeq \frac{f(\mathbf{x} + h \mathbf{e}_i) - f(\mathbf{x})}{h}$$

Drawbacks:

  • Computing \(\nabla f = [\frac{\partial f}{\partial x_1}, \cdots, \frac{\partial f}{\partial x_n}]\) takes \(n\) computations 
  • Inexact method
  • How to choose \(h\)?

Automatic differentiation is not...

Numerical differentiation 

Example:  

from scipy.optimize import approx_fprime


approx_fprime(0.25, f, 1e-7)
>>>> -16.00001599

Automatic differentiation is not...

Symbolic differentiation

- Takes as input a function specified as symbolic operations

- Apply the usual rules of differentiation to give the derivative as symbolic operations

 

Example:

\(f_4(x) = 64x(1−x)(1−2x)^2 (1−8x+ 8x^2 )^2\), so:

f'_4(x) = 128x(1 − x)(−8 + 16x)(1 − 2x)^2(1 − 8x+ 8x^2)+ 64(1−x)(1−2x)^2(1−8x+ 8x^2)^2−64x(1 − 2x)^2(1 − 8x+ 8x^2)^2 − 256x(1 − x)(1 −2x)(1 − 8x + 8x^2 )^2

Then, evaluate \(f'(x)\)

Automatic differentiation is not...

Symbolic differentiation

- Exact

- Expression swell: derivatives can have many more terms than the base function

 

        \(f_n\)                                     \(f'_n\)                                          \(f'_n\)  (simplified)

Automatic differentiation:

Apply symbolic differentiation at the elementary operation level and keep intermediate numerical results, in lockstep with the evaluation of the main function.

 

- Function = graph of elementary operations

- Follow the graph and differentiate each operation using differentiation rules (linearity, chain rule, ...)

def f(x, n=4):
  v = x
  for i in range(n):
    v = 4 * v * (1 - v)
  return v


f(0.25)
>>>> 0.75
def g(x, n=4):
  v, dv = x, 1.
  for i in range(n):
    v, dv = 4 * v * (1 - v), 4 * dv * (1 - v) - 4 * v * dv
  return dv


g(0.25)
>>>> -16.0

Forward automatic differentiation:

Apply symbolic differentiation at the elementary operation level and keep intermediate numerical results, in lockstep with the evaluation of the main function.

 

- Function = graph of elementary operations

- Follow the graph and differentiate each operation using differentiation rules (linearity, chain rule, ...)

 

- If \(f:\mathbb{R}\to \mathbb{R}^m\): need one pass to compute all derivatives :)

- If \(f:\mathbb{R}^n \to \mathbb{R}\): need \(n\) passes to compute all derivatives :( 

- Bad for ML

Reverse automatic differentiation: Backprop

- Function = graph of elementary operations

- Compute the graph and its elements

- Go through the graph backwards to compute the derivatives

def f(x, n=4):
  v = x
  for i in range(n):
    v = 4 * v * (1 - v)
  return v


f(0.25)
>>>> 0.75
def g(x, n=4):
  v = x
  memory = []
  for i in range(n):
    memory.append(v)
    v = 4 * v * (1 - v)
  dv = 1
  for v in memory[::-1]:
    dv = 4 * dv * (1 - v) - 4 * dv * v
  return dv


g(0.25)
>>>> -16.0

Reverse automatic differentiation: Backprop

- Function = graph of elementary operations

- Compute the graph and its elements

- Go through the graph backwards to compute the derivatives

 

-Only one passe to compute gradients of functions \(\mathbb{R}^n \to \mathbb{R}\) :)

Example on a 2d function

$$f(x, y) = yx^2, \enspace x = y= 1$$

Function

 \(x =1\) 

 \(y = 1\)

 

 \(v_1 = x^2 = 1\)

 \(v_2 =yv_1 = 1\)

 

\(f = v_2 =1\)

Forward AD (w.r.t. \(x\))

 

 

 

 

 

 

 

 

Backprop \( \)

 

 

 

 

 

 

 

 

 \(\frac{dx}{dx} =1\) 

 \(\frac{dy}{dx} = 0\)

 \(\frac{dv_1}{dx} =2x \frac{dx}{dx} = 2\)

 \(\frac{dv_2}{dx} =y\frac{dv_1}{dx} +v_1 \frac{dy}{dx} =2\)

\(\frac{df}{dx} = \frac{dv_2}{dx}=2\)

 \(\frac{df}{dv_2}= 1\)

 \(\frac{df}{dy} = \frac{df}{dv_2 }\frac{dv_2}{dy} = \frac{df}{dv_2 }v_1 = 1\)

 \(\frac{df}{dv_1} =y\frac{df}{ dv_2} = 1\)

 \(\frac{df}{dx} = 2 x \frac{df}{dv_1} = 2\)

Your turn !

$$f(x, y) =\frac{x^2 + 2y^2}{xy}, \enspace x = y= 1$$

Function

 \(x =1\) 

 \(y = 1\)

 

 \(v_1 = x^2 + 2y^2 = 3\)

 \(v_2 =xy= 1\)

 

\(f = \frac{v_1}{v_2} =3\)

Backprop \( \)

 

 

 

 

 

 

 

 

 \(\frac{df}{dv_1}= \frac1{v_2} = 1\)

 \(\frac{df}{dv_2} = -\frac{v_1}{v_2^2}=-3\) 

\(\frac{df}{dx} = \frac{df}{dv_2}\frac{dv_2}{dx} + \frac{df}{dv_1}\frac{dv_1}{dx}= -1\)

\(\frac{df}{dy} = \frac{df}{dv_2}\frac{dv_2}{dy} + \frac{df}{dv_1}\frac{dv_1}{dy}= 1\)

Automatic differentiation:

- Exact

- Takes about the same time to compute the gradient and the function

- Requires memory: need to store intermediate variables 

- Easy to use

- Available in Pytorch, Tensorflow, package autograd with numpy

from autograd import grad

g = grad(f)


g(0.25)
>>>> -16.0

Cours autodiff

By Pierre Ablin

Cours autodiff

  • 500