Robert Salomone

How did I get here?

I am not a computer scientist.

Application Areas

Derivative information is used in algorithms in...

• Machine Learning (Optimization)
• Computational Statistics (Sampling)

Goals

• Understand what Automatic Differentiation is and isn't
• Demystify why Pytorch and Tensorflow are (seemingly) so weird.
• Demonstrate Autodiff in Python with some "advanced" features using Pytorch and Autograd.

The Good News

You can get (reasonably) far without knowing much.

The Not So Good News

You can get only get so far without knowing much.

The Plan

First Half: A gentle yet complete introduction to what AD is and what it is doing.

(key to understanding any AD software framework)

Second Half: Tour through two popular AD frameworks in Python, including some nice "tricks" to ease the learning curve.

Reference

Baydin et al. 2017. Automatic differentiation in machine learning: a survey. Journal of Machine Learning Research 18, 1.

Finite Differences

\frac{\partial}{\partial x_i} \approx \frac{f(x_1, \ldots, x_i + h,\ldots, x_n) - f(x_1,\ldots, x_i, \ldots, x_n)}{h},
\frac{\partial f}{\partial x_i} := \lim_{h\rightarrow 0}\frac{f(x_1, \ldots, x_i + h,\ldots, x_n) - f(x_1,\ldots, x_i, \ldots, x_n)}{h}

For small $$h$$, we have the forward difference approximation --- $$O(h)$$:

\frac{\partial}{\partial x_i} \approx \frac{f(x_1, \ldots, x_i + h,\ldots, x_n) - f(x_1,\ldots, x_i-h, \ldots, x_n)}{h}.

and  central difference approximation --- $$O(h^2)$$:

Remember what a (partial) derivative is....

Finite differences are not only inaccurate, but also inefficient.

*from [1]

Expression Swell

* from [1]

• Scalar-Valued Function $$f: \mathbb{R^n} \rightarrow \mathbb{R}$$​, i.e., $$y = f(\mathbf{x})$$.
\nabla f(\mathbf{x}) := \begin{pmatrix} \frac{dy}{dx_1} \\ \vdots \\ \frac{dy}{dx_n} \end{pmatrix}

Finite Difference: Approximates $$\nabla f(\mathbf{x})$$, requiring (at least) $$n+1$$ function evaluations for each desired $$\mathbf{x}$$.

Symbolic: Computes $$\nabla f$$ as its own function object. Then, requires evaluation of this object with numerical values as input.

Automatic Differentiation: Computes exactly $$\nabla f(\mathbf{x})$$ and $$f(\mathbf{x})$$. If done using the correct AD mode, it requires only (roughly) two function evaluations for each $$\mathbf{x}$$.

Why is PyTorch / Tensorflow so confusing?

(2) Autodiff isn't really taking partial derivatives or gradients: it is doing something else.

(1) Automatic Differentiation relies on an abstract concept called a computational graph

Solution: Understand these concepts first, then use the software.

Computation Graphs and Primal Traces

All functions are made up of more elementary functions...



f(x_1, x_2) = z_1(x_1) + z_2(x_1,x_2) = 2x_1 + x_1 x_2

$$v_3 = v_0 v_1$$

$$x_1$$

$$v_4 = v_2 + v_3$$

$$x_2$$

$$v_2 = 2 v_0$$

$$v_0$$

$$v_1$$

Primal Trace

Computation Graph for $$f(x_1, x_2) = 2x_1 + x_1 x_2$$

$$f$$

$$y$$

v_0 = x_1 \\ v_1 = x_2 \\ v_2 = 2v_0 \\ v_3 = v_0 v_1 \\ v_4 = v_2 + v_3

Exercise 1 (Graphs and Primal Trace)

Draw the computation graph and list the primal trace for the function

$$f(x_1, x_2) = \log(x_1) + x_1 x_2 - \sin(x_2)$$

Note: The graph and primal trace in general are not unique (different order / labels).

Hint: Pretend you are actually evaluating the function, and keep a running journal of what you do.

Solution (from Reference)

A graph is all you need...

• Autodiff only requires a computational graph for the input of interest
• Can include recursions, conditionals (if statements), loops, etc.
• When you run the function, it just needs to take a trace of what it does with that input!
• This can include "fancy" stuff like...
• ODE Solver (Runge-Kutta, Implicit Euler)
• Fast Fourier Transform
• Matrix Cholesky Factorization
• Matrix (log-)Determinant

Reminder: Multivariate Chain Rule

$$z_2(x_1, x_2) = x_1 x_2$$

\frac{\partial y}{\partial x_1}=
\frac{\partial y}{\partial z_1}\frac{\partial z_1}{\partial x_1}
+\frac{\partial y}{\partial z_2}\frac{\partial z_2}{\partial x_1}

$$x_1$$

$$y(z_1(x_1), z_2(x_1, x_2))$$

$$\qquad= 2x_1 + z_2(x_1, x_2)$$

$$x_2$$

$$z_1(x_1) = 2 x_1$$

f(x_1, x_2) = z_1(x_1) + z_2(x_1,x_2) = 2x_1 + x_1 x_2

Multivariate Chain Rule

$$z_2(x_1, x_2) = x_1 x_2$$

\frac{\partial y}{\partial x_2}=
\frac{\partial y}{\partial z_2}\frac{\partial z_2}{\partial x_2}

$$x_1$$

$$y(z_1(x_1), z_2(x_1, x_2))$$

$$\qquad= 2x_1 + z_2(x_1, x_2)$$

$$x_2$$

$$z_1(x_1) = 2 x_1$$

f(x_1, x_2) = z_1(x_1) + z_2(x_1,x_2) = 2x_1 + x_1 x_2

\dot{v}

$$v_3 = v_0 v_1$$

$$x_1$$

$$v_4 = v_2 + v_3$$

$$x_2$$

$$v_2 = 2 v_0$$

$$v_0$$

$$v_1$$

$$f$$

$$y$$

f(x_1, x_2) = z_1(x_1) + z_2(x_1,x_2) = 2x_1 + x_1 x_2

Primal Trace

v_0 = x_1 \\ v_1 = x_2 \\ v_2 = 2v_0 \\ v_3 = v_0 v_1 \\ v_4 = v_2 + v_3

$$\dot{v}_{k,i} := \frac{\partial v_k}{\partial x_{i}}$$

$$v_3 = v_0 v_1$$

$$x_1$$

$$v_4 = v_2 + v_3$$

$$x_2$$

$$v_2 = 2 v_0$$

$$v_0$$

$$v_1$$

$$f$$

$$y$$

(Partial) Tangent Trace: $$\dot{v}_{k,i} := \frac{\partial v_k}{\partial x_{i}}$$

$$\dot{v}_{0,1} = \frac{\partial v_0}{\partial x_1}=1$$

$$\dot{v}_{2,1} = \frac{\partial v_2}{\partial x_1}=\frac{\partial v_2}{\partial v_0}\frac{\partial v_0}{\partial x_1}= \frac{\partial v_2}{\partial v_0}\dot{v}_{0,1}$$

$$\dot{v}_{3,1} = \frac{\partial v_3}{\partial x_1}=\frac{\partial v_3}{\partial v_0}\frac{\partial v_0}{\partial x_1}= \frac{\partial v_3}{\partial v_0}\dot{v}_{0,1}$$

$$\dot{v}_{4,1} = \frac{\partial v_4}{\partial x_1}=\frac{\partial v_4}{\partial v_2}\frac{\partial v_2}{\partial x_1} + \frac{\partial v_4}{\partial v_3}\frac{\partial v_3}{\partial x_1}= \frac{\partial v_4}{\partial v_3}\dot{v}_{2,1} + \frac{\partial v_4}{\partial v_3}\dot{v}_{3,1}$$

Key Aspect: only expand the chain rule once at each stage, reusing the (forward) accumulated derivatives.

Intuition

Primal Trace

v_0 = x_1 \\ v_1 = x_2 \\ v_2 = 2v_0 \\ v_3 = v_0 v_1 \\ v_4 = v_2 + v_3
f(x_1, x_2) = z_1(x_1) + z_2(x_1,x_2) = 2x_1 + x_1 x_2

Differentiability

In practice, we only need the function to be differentiable at the points we evaluate it along the tangent trace.

So e.g., the absolute value function $$|\cdot |$$

is not problematic in practice so long as we don't input zero!

Let's generalize a little...

\dot{v}_k := \sum_{v_i \in {\rm parents}(v_k)}\frac{\partial v_k}{\partial v_i}\dot{v}_i.

The above is are elements of what is called the tangent trace

Note: Initializing $$\dot{\mathbf{x}} = (1,0)$$ or $$\dot{\mathbf{x}}=(0,1)$$ makes $$\dot{y}$$ be the derivative with respect to $$x_1$$ or $$x_2$$ respectively.

Let's now consider

$$v_3 = v_0 v_1$$

$$x_1$$

$$v_4 = v_2 + v_3$$

$$x_2$$

$$v_2 = 2 v_0$$

$$v_0$$

$$v_1$$

$$f$$

$$y$$

$$\dot{v}_0 = \dot{x}_1$$

$$\dot{v}_1 = \dot{x}_2$$

$$\dot{v}_3 = \dot{v}_0 v_1 + v_0 \dot{v}_1$$

$$\dot{v}_2 = 2\dot{v}_0$$

$$\dot{v}_4 = \dot{v}_2 + \dot{v}_3$$

$$\dot{y}$$

\dot{v}_k := \sum_{v_i \in {\rm parents}(v_k)}\frac{\partial v_k}{\partial v_i}\dot{v}_i

Exercise 2 (Forward Mode)

\dot{v}_k := \sum_{i \in {\rm parents}(v_k)}\frac{\partial v_k}{\partial v_i}\dot{v}_i

With the function from the last example, compute the tangent trace for each graph node.

$$f(x_1, x_2) = \log(x_1) + x_1 x_2 - \sin(x_2)$$

Then, use it to evaluate $$\frac{\partial f}{\partial x_1}$$ at (2,5) by setting $$\dot \mathbf{x} = (1,0)$$

Name that Mathematician

Carl Gustav Jacob Jacobi

Most famous for:

theory of elliptic functions

(also that matrix of derivatives from second-year calculus)

Jacobian Matrix

\mathrm{J}_{\mathbf{f}} := \begin{bmatrix} \frac{dy_1}{dx_1} & \cdots & \frac{dy_1}{dx_n} \\ \vdots& \ddots& \vdots \\ \frac{dy_m}{dx_1} & \cdots & \frac{dy_m}{dx_n} \\ \end{bmatrix}

Quiz: The $$k$$-th row of $$\mathrm{J}_{\mathbf{f}}(\mathbf{x})$$ is equal to

\nabla y_k(\mathbf{x})
\mathbf{f}(\mathbf{x}) = (f_1(\mathbf{x}), \ldots, f_m(\mathbf{x})) = \mathbf{y}

$$\mathbf{f}: \mathbb{R}^n \rightarrow \mathbb{R}^m$$

\mathrm{J}_{\mathbf{f}} \in \mathbb{R}^{m \times n}

Jacobian-Vector Product (JVP)

\mathrm{J}_{\mathbf{f}}(\mathbf{x})\, \dot{\mathbf{x}} = \begin{bmatrix} \frac{dy_1(\mathbf{x})}{dx_1} & \cdots & \frac{dy_1(\mathbf{x})}{dx_n} \\ \vdots& \ddots& \vdots \\ \frac{dy_m(\mathbf{x})}{dx_1} & \cdots & \frac{dy_m(\mathbf{x})}{dx_n} \\ \end{bmatrix} \begin{bmatrix} \dot{x}_1 \\ \vdots \\ \dot{x}_n \end{bmatrix} =\begin{bmatrix} s_1 \\ \vdots \\ s_m \end{bmatrix}

Forward Mode actually evaluates this for arbitrary $$\dot{\mathbf{x}}$$.

The Jacobian of a scalar-valued function $$f: \mathbb{R}^n \rightarrow \mathbb{R}$$ is its ...

gradient transposed: $$\nabla f(\cdot)^\top$$.

In this case, the Jacobian-Vector product is equal to

\nabla f(\mathbf{x})^\top \mathbf{\dot{x}} = s,

i.e., the

Directional Derivative of $$f$$($$\mathbf{x}$$)  in the direction $$\mathbf{\dot{x}}$$.

Look at one part...

$$v_3 = v_0 v_1$$

$$x_1$$

$$x_2$$

$$v_2 = 2 v_0$$

$$v_0$$

$$v_1$$

$$\dot{v}_0 = \dot{x}_1$$

$$\dot{v}_1 = \dot{x}_2$$

$$\dot{v}_3 = \dot{v}_0 v_1 + v_0 \dot{v}_1$$

$$\dot{v}_2 = 2\dot{v}_0$$

\mathrm{J}_{v_3} = \big[\frac{\partial v_3}{\partial v_0}, \frac{\partial v_3}{\partial v_1}\big] = \big[ v_0, v_1\big]
\dot{v_3} = \mathrm{J}_{v_3}\begin{pmatrix} \dot{v}_0 \\ \dot{v}_1 \end{pmatrix}

So...

\dot{v}_k := \sum_{i \in {\rm parents}(v_k)}\frac{\partial v_k}{\partial v_i}\dot{v}_i = \mathrm{J}_{v_k}\dot{v}_{{\rm parents}(v_k)}

Forward accumulation computes JVPs and passes them forward...

Note that we never need to form the full Jacobian matrix/matrices!

The "Magic"

We can initialize $$\dot{\mathbf{x}} = \mathbf{r}$$ and compute any JVP: $$\mathrm{J}_f(\mathbf{x}) \mathbf{r}$$ in only one pass through the computation graph!

For $$f: \mathbb{R}^n \rightarrow \mathbb{R}$$, initializing with $$\dot{\mathbf{x}}=\mathbf{e}_i$$ gives the $$i$$-th partial derivative.

This involves roughly the same effort as computing the initial function twice!

More generally...

\mathrm{J}_{\mathbf{f}}(\mathbf{x}) \begin{bmatrix} 0 \\ \vdots \\ 1 \end{bmatrix} = \begin{bmatrix} \frac{dy_1}{dx_1} & \cdots & \frac{dy_1}{dx_n} \\ \vdots& \ddots& \vdots \\ \frac{dy_m}{dx_1} & \cdots & \frac{dy_m}{dx_n} \\ \end{bmatrix} \begin{bmatrix} 0 \\ \vdots \\ 1 \end{bmatrix} =
\mathbf{f}(\mathbf{x}) = (f_1(\mathbf{x}), \ldots, f_m(\mathbf{x})) = \mathbf{y}

$$\mathbf{f}: \mathbb{R}^n \rightarrow \mathbb{R}^m$$

One forward mode sweep can evaluate a column of the Jacobian at $$\mathbf{x}$$!

Setting $$\dot{\mathbf{x}} = \mathbf{e}_i$$ evaluates the $$i$$-th column.

\begin{bmatrix} \frac{dy_1}{dx_m} \\ \vdots \\ \frac{dy_m}{dx_n} \\ \end{bmatrix}

Quiz

Forward mode requires      sweeps of the graph to evaluate the full Jacobian.

For $$m=1$$, $$\nabla f(\mathbf{x}) = \mathrm{J}_f (\bf x)^\top$$           :(

\mathbf{f}: \mathbb{R}^n \rightarrow \mathbb{R}^m,\quad \mathrm{J}_{\mathbf{f}} \in \mathbb{R}^{m \times n} \text{ (Jacobian)}

For $$n=1, \mathrm{J}_f(\mathbf{x}) = \mathrm{J}_{f}(\mathbf{x})\times 1$$        :D

n

\overline{v}

Back to Multivariate Chain Rule

$$z_2(x_1, x_2) = x_1 x_2$$

\frac{\partial y}{\partial x_1}=
\frac{\partial y}{\partial z_1}\frac{\partial z_1}{\partial x_1}
+\frac{\partial y}{\partial z_2}\frac{\partial z_2}{\partial x_1}

$$x_1$$

$$y(z_1(x_1), z_2(x_1, x_2))$$

$$\qquad= 2x_1 + z_2(x_1, x_2)$$

$$x_2$$

$$z_1(x_1) = 2 x_1$$

f(x_1, x_2) = z_1(x_1) + z_2(x_1,x_2) = 2x_1 + x_1 x_2

Reverse Mode is Generalized Backpropagation

\widehat{y} = {\rm W}_2 \, {\rm max}\left\{\mathbf{0},{\rm W}_1 \mathbf{x} + b_1\right\} + b_2
\text{Loss}(\widehat{y},y)

$$\widehat{y}$$

$$y$$

$$b_1$$

$$b_2$$

$$x_1$$

$$x_2$$

$$x_3$$

Everything Else:

\overline{v}_{k,i} = \frac{\partial y_i}{\partial v_k}

compare with (Partial) Tangents:

\dot{v}_{k,i} := \frac{\partial v_k}{\partial x_i}
\overline{v}_k := \sum_{i \in {\rm children}(v_k)}\textcolor{blue}{\frac{\partial v_i}{\partial v_k}}\overline{v}_i
\dot{v}_k := \sum_{i \in {\rm parents}(v_k)}\textcolor{red}{\frac{\partial v_k}{\partial v_i}}\dot{v}_i

\overline{v}
\dot{v}

Exercise 3 (Reverse Mode)

\dot{v}_k := \sum_{i \in {\rm parents}(v_k)}\frac{\partial v_k}{\partial v_i}\dot{v}_i

With the function from the last exercises, compute the adjoint trace for each graph node.

$$f(x_1, x_2) = \log(x_1) + x_1 x_2 - \sin(x_2)$$

Vector Jacobian Product (VJP)

(\mathbf{r}^\top \mathrm{J}_f)^\top = \mathrm{J}_f^\top \mathbf{r} = \begin{bmatrix} \frac{dy_1}{dx_1} & \cdots & \frac{dy_1}{dx_n} \\ \vdots& \ddots& \vdots \\ \frac{dy_m}{dx_1} & \cdots & \frac{dy_m}{dx_n} \\ \end{bmatrix}^\top \begin{bmatrix} r_1 \\ \vdots \\ r_m \end{bmatrix} =\begin{bmatrix} s_1 \\ \vdots \\ s_n \end{bmatrix}^\top

Evaluate the VJP by initializing the reverse phase with $$\overline{\mathbf{y}} = \mathbf{r}.$$

\overline{v}_k := \sum_{i \in {\rm children}(v_k)}\textcolor{blue}{\frac{\partial v_i}{\partial v_k}}\overline{v}_i

Key Observation

This single node "backpropagation" is a Vector-Jacobian product!

$$v_3 = v_0 v_1$$

$$x_1$$

$$x_2$$

$$v_2 = 2 v_0$$

$$v_0$$

$$v_1$$

$$\overline{v}_0$$

$$\overline{v}_1 = \overline{x}_2$$

$$\overline{v}_3$$

$$\overline{v}_2$$

$$v_3 = {\rm SOLVE}_{\mathbf{y}} (\mathrm{v_0}\mathbf{y}=v_1)$$

$$\mathrm{A}$$

$$v_4 = v_2 v_3$$

$$\mathbf{x}$$

$$v_2 = \mathrm{A}^{1/2}$$

$$v_0$$

$$v_1$$

Graph nodes need not contain elementary operations: you can compress it so long as you know a nodes VJP function!

$$f$$

$$\mathbf{y}$$

Primatives

Recall that Forward Mode with unit vectors for $$\dot{\mathbf{x}}$$ evaluates a column of the Jacobian.

Backward Mode with unit vectors for $$\dot{\mathbf{x}}$$ evaluates a row of the Jacobian.

Quiz (Reverse Mode)

For $$f: \mathbb{R}^n \rightarrow \mathbb{R}$$, we can evaluate the gradient with     backward sweep(s).

1

For $$f: \mathbb{R} \rightarrow \mathbb{R}^m$$, we can obtain the Jacobian with       backward sweep(s).

m

$$\nabla(\mathbf{x})^\top = 1 \times {\rm J_f}(\mathbf{x})$$

(one row each time)

Automatic Differentiation: Summary

Forward JVP:

Column
Backward VJP:  Row

\mathbb{R}^n \rightarrow \mathbb{R}^m
\mathbb{R}^m \rightarrow \mathbb{R}^n
\mathrm{J}_{\mathbf{f}}[i,:](\mathbf{x})
\mathrm{J}_{\mathbf{f}}[:,i](\mathbf{x})
(\dot{\mathbf{x}} \in \mathbb{R}^n)
(\overline{\mathbf{y}} \in \mathbb{R}^m)
\dot{\mathbf{x}}/\overline{\mathbf{y}} = \mathbf{e}_i
\mathbf{f}: \mathbb{R}^n \rightarrow \mathbb{R}^m,\quad \mathrm{J}_{\mathbf{f}} \in \mathbb{R}^{m \times n} \text{ (Jacobian)}
\big(\nabla f(\mathbf{x})^\top \text{ for } m=1\big)
\overline{\mathbf{y}}^\top \mathrm{J}_{\mathbf{f}} = (\mathrm{J}_{\mathbf{f}}^\top \overline{\mathbf{y}})^\top
\mathrm{J}_{\mathbf{f}}\dot{\mathbf{x}}

Can you combine the two in smart ways?

Yes

(I'll give you the most useful example...)

Hessians

\mathrm{H}_f(\mathbf{x}) := \begin{bmatrix} \frac{dy}{\partial x_1 \partial x_1} & \cdots & \frac{dy}{\partial x_1 \partial x_n} \\ \vdots& \ddots& \vdots \\ \frac{dy}{\partial x_n \partial x_1} & \cdots & \frac{dy_m}{dx_n dx_n} \\ \end{bmatrix}
• Suppose that $$f: \mathbb{R}^n \rightarrow \mathbb{R}$$. Then, the Hessian of $$f$$ is

$${\rm H}_{f}(\mathbf{x}) = \mathrm{J_{\nabla f}}(\mathbf{x}) = \textcolor{red}{\nabla} \big( \nabla f(\mathbf{x})\big)$$

• The Hessian is the                 of the               of $$f$$

Jacobian

• VJP (Forward Mode) to obtain trace of (*) $$\mathbf{u}^\top\nabla f(\mathbf{x})$$
• Take Gradient of (*) (Reverse Mode):
\nabla\left(\mathbf{r}^\top \nabla f(\mathbf{x})\right) = \mathbf{r}^\top \nabla\nabla f(\mathbf{x}) = \mathbf{r}^\top {\rm H} = ({\rm H}\mathbf{r})^\top

Hessian Vector Product (HVP)

Requires effort of (roughly) only four function evaluations!

Inverse-Hessian Vector Product

Can even evaluate $$\mathrm{H}^{-1}\mathbf{r}$$  by touching only HVPs (using the Conjugate Gradient method).

e.g., Neural Net with with $$10^6$$ parameters. Hessian has $$10^{12}$$ elements: 8 terabytes!

Question and Discussion Time!

Part II : Practice

With its updated version of Autograd, JAX can automatically differentiate native Python and NumPy functions.

Graphics Processing Units (GPUs)

Good at doing many linear algebra operations (VJP/JVP) at once.

$$v_3 = {\rm SOLVE}_{\mathbf{y}} (\mathrm{v_0}\mathbf{y}=v_1)$$

$$\mathrm{A}$$

$$v_4 = v_2 v_3$$

$$\mathbf{x}$$

$$v_2 = \mathrm{A}^{1/2}$$

$$v_0$$

$$v_1$$

$$f$$

$$\mathbf{y}$$

$$\overline{v}_1$$:JVP-2

$$\overline{v}_0$$:JVP-1

Parallelization

Forward Mode Reverse Mode GPU
Tensorflow

PyTorch

Jax
\checkmark
\checkmark
\checkmark
\checkmark
\checkmark
\checkmark
\checkmark
\checkmark
\checkmark
\text{\sf X}
\text{\sf X}
\text{\sf X}

Python Frameworks

Generally reverse mode will be enough, but beware!

Where's the VJP?!

Using software that doesn't have forward mode?

You can get a JVP using a VJP of a VJP.

By Rob Salomone

• 451