Introduction to automatic differentiation
Pierre Ablin
Introduction to automatic differentiation
[Baydin et al., 2015, Automatic differentiation in machine learning: a survey]
Automatic differentiation ?
- Method to compute the differential of a function using a computer
Input
def f(x):
return x ** 2
f(1.)
>>>> 1.
Output
g = grad(f)
g(1.)
>>>> 2.0
Automatic differentiation ?
- Method to compute the differential of a function using a computer
Input
def f(x):
return np.log(1 + x ** 2) / x
f(1.)
>>>> 0.6931471805599453
Output
g = grad(f)
g(1.)
>>>> 0.3068528194400547
Prototypical case
\(f\) defined recursively:
- \(f_0(x) = x\)
- \(f_{k+1}(x) = 4 f_{k}(x) (1 - f_k(x))\)
Input
def f(x, n=4):
v = x
for i in range(n):
v = 4 * v * (1 - v)
return v
f(0.25)
>>>> 0.75
Output
g = grad(f)
g(0.25)
>>>> -16.0
Automatic differentiation is not...
Numerical differentiation
$$f'(x) \simeq \frac{f(x + h) - f(x)}{h}$$
In higher dimension:
$$ \frac{\partial f} {\partial x_i} (\mathbf{x}) \simeq \frac{f(\mathbf{x} + h \mathbf{e}_i) - f(\mathbf{x})}{h}$$
Drawbacks:
- Computing \(\nabla f = [\frac{\partial f}{\partial x_1}, \cdots, \frac{\partial f}{\partial x_n}]\) takes \(n\) computations
- Inexact method
- How to choose \(h\)?
Automatic differentiation is not...
Numerical differentiation
Example:
from scipy.optimize import approx_fprime
approx_fprime(0.25, f, 1e-7)
>>>> -16.00001599
Automatic differentiation is not...
Symbolic differentiation
- Takes as input a function specified as symbolic operations
- Apply the usual rules of differentiation to give the derivative as symbolic operations
Example:
\(f_4(x) = 64x(1−x)(1−2x)^2 (1−8x+ 8x^2 )^2\), so:
Then, evaluate \(f'(x)\)
Automatic differentiation is not...
Symbolic differentiation
- Exact
- Expression swell: derivatives can have many more terms than the base function
\(f_n\) \(f'_n\) \(f'_n\) (simplified)
Automatic differentiation:
Apply symbolic differentiation at the elementary operation level and keep intermediate numerical results, in lockstep with the evaluation of the main function.
- Function = graph of elementary operations
- Follow the graph and differentiate each operation using differentiation rules (linearity, chain rule, ...)
def f(x, n=4):
v = x
for i in range(n):
v = 4 * v * (1 - v)
return v
f(0.25)
>>>> 0.75
def g(x, n=4):
v, dv = x, 1.
for i in range(n):
v, dv = 4 * v * (1 - v), 4 * dv * (1 - v) - 4 * v * dv
return dv
g(0.25)
>>>> -16.0
Forward automatic differentiation:
Apply symbolic differentiation at the elementary operation level and keep intermediate numerical results, in lockstep with the evaluation of the main function.
- Function = graph of elementary operations
- Follow the graph and differentiate each operation using differentiation rules (linearity, chain rule, ...)
- If \(f:\mathbb{R}\to \mathbb{R}^m\): need one pass to compute all derivatives :)
- If \(f:\mathbb{R}^n \to \mathbb{R}\): need \(n\) passes to compute all derivatives :(
- Bad for ML
Reverse automatic differentiation: Backprop
- Function = graph of elementary operations
- Compute the graph and its elements
- Go through the graph backwards to compute the derivatives
def f(x, n=4):
v = x
for i in range(n):
v = 4 * v * (1 - v)
return v
f(0.25)
>>>> 0.75
def g(x, n=4):
v = x
memory = []
for i in range(n):
memory.append(v)
v = 4 * v * (1 - v)
dv = 1
for v in memory[::-1]:
dv = 4 * dv * (1 - v) - 4 * dv * v
return dv
g(0.25)
>>>> -16.0
Reverse automatic differentiation: Backprop
- Function = graph of elementary operations
- Compute the graph and its elements
- Go through the graph backwards to compute the derivatives
-Only one passe to compute gradients of functions \(\mathbb{R}^n \to \mathbb{R}\) :)
Example on a 2d function
$$f(x, y) = yx^2, \enspace x = y= 1$$
Function
\(x =1\)
\(y = 1\)
\(v_1 = x^2 = 1\)
\(v_2 =yv_1 = 1\)
\(f = v_2 =1\)
Forward AD (w.r.t. \(x\))
Backprop \( \)
\(\frac{dx}{dx} =1\)
\(\frac{dy}{dx} = 0\)
\(\frac{dv_1}{dx} =2x \frac{dx}{dx} = 2\)
\(\frac{dv_2}{dx} =y\frac{dv_1}{dx} +v_1 \frac{dy}{dx} =2\)
\(\frac{df}{dx} = \frac{dv_2}{dx}=2\)
\(\frac{df}{dv_2}= 1\)
\(\frac{df}{dy} = \frac{df}{dv_2 }\frac{dv_2}{dy} = \frac{df}{dv_2 }v_1 = 1\)
\(\frac{df}{dv_1} =y\frac{df}{ dv_2} = 1\)
\(\frac{df}{dx} = 2 x \frac{df}{dv_1} = 2\)
Your turn !
$$f(x, y) =\frac{x^2 + 2y^2}{xy}, \enspace x = y= 1$$
Function
\(x =1\)
\(y = 1\)
\(v_1 = x^2 + 2y^2 = 3\)
\(v_2 =xy= 1\)
\(f = \frac{v_1}{v_2} =3\)
Backprop \( \)
\(\frac{df}{dv_1}= \frac1{v_2} = 1\)
\(\frac{df}{dv_2} = -\frac{v_1}{v_2^2}=-3\)
\(\frac{df}{dx} = \frac{df}{dv_2}\frac{dv_2}{dx} + \frac{df}{dv_1}\frac{dv_1}{dx}= -1\)
\(\frac{df}{dy} = \frac{df}{dv_2}\frac{dv_2}{dy} + \frac{df}{dv_1}\frac{dv_1}{dy}= 1\)
Automatic differentiation:
- Exact
- Takes about the same time to compute the gradient and the function
- Requires memory: need to store intermediate variables
- Easy to use
- Available in Pytorch, Tensorflow, package autograd with numpy
from autograd import grad
g = grad(f)
g(0.25)
>>>> -16.0
Cours autodiff
By Pierre Ablin
Cours autodiff
- 575