Robert Salomone

# How did I get here?

I am not a computer scientist.

# Application Areas

Derivative information is used in algorithms in...

• Machine Learning (Optimization)
• Computational Statistics (Sampling)

# Goals

• Understand what Automatic Differentiation is and isn't
• Demystify why Pytorch and Tensorflow are (seemingly) so weird.
• Demonstrate Autodiff in Python with some "advanced" features using Pytorch and Autograd.

# The Good News

You can get (reasonably) far without knowing much.

# The Not So Good News

You can get only get so far without knowing much.

# The Plan

First Half: A gentle yet complete introduction to what AD is and what it is doing.

(key to understanding any AD software framework)

Second Half: Tour through two popular AD frameworks in Python, including some nice "tricks" to ease the learning curve.

# Reference

Baydin et al. 2017. Automatic differentiation in machine learning: a survey. Journal of Machine Learning Research 18, 1.

# Finite Differences

\frac{\partial}{\partial x_i} \approx \frac{f(x_1, \ldots, x_i + h,\ldots, x_n) - f(x_1,\ldots, x_i, \ldots, x_n)}{h},
\frac{\partial f}{\partial x_i} := \lim_{h\rightarrow 0}\frac{f(x_1, \ldots, x_i + h,\ldots, x_n) - f(x_1,\ldots, x_i, \ldots, x_n)}{h}

For small $$h$$, we have the forward difference approximation --- $$O(h)$$:

\frac{\partial}{\partial x_i} \approx \frac{f(x_1, \ldots, x_i + h,\ldots, x_n) - f(x_1,\ldots, x_i-h, \ldots, x_n)}{h}.

and  central difference approximation --- $$O(h^2)$$:

Remember what a (partial) derivative is....

Finite differences are not only inaccurate, but also inefficient.

*from 

# Expression Swell

* from 

• Scalar-Valued Function $$f: \mathbb{R^n} \rightarrow \mathbb{R}$$​, i.e., $$y = f(\mathbf{x})$$.
\nabla f(\mathbf{x}) := \begin{pmatrix} \frac{dy}{dx_1} \\ \vdots \\ \frac{dy}{dx_n} \end{pmatrix}

Finite Difference: Approximates $$\nabla f(\mathbf{x})$$, requiring (at least) $$n+1$$ function evaluations for each desired $$\mathbf{x}$$.

Symbolic: Computes $$\nabla f$$ as its own function object. Then, requires evaluation of this object with numerical values as input.

Automatic Differentiation: Computes exactly $$\nabla f(\mathbf{x})$$ and $$f(\mathbf{x})$$. If done using the correct AD mode, it requires only (roughly) two function evaluations for each $$\mathbf{x}$$.

# Why is PyTorch / Tensorflow so confusing?

(2) Autodiff isn't really taking partial derivatives or gradients: it is doing something else.

(1) Automatic Differentiation relies on an abstract concept called a computational graph

Solution: Understand these concepts first, then use the software.

# Computation Graphs and Primal Traces

## All functions are made up of more elementary functions...



f(x_1, x_2) = z_1(x_1) + z_2(x_1,x_2) = 2x_1 + x_1 x_2

$$v_3 = v_0 v_1$$

$$x_1$$

$$v_4 = v_2 + v_3$$

$$x_2$$

$$v_2 = 2 v_0$$

$$v_0$$

$$v_1$$

Primal Trace

### Computation Graph for $$f(x_1, x_2) = 2x_1 + x_1 x_2$$

$$f$$

$$y$$

v_0 = x_1 \\ v_1 = x_2 \\ v_2 = 2v_0 \\ v_3 = v_0 v_1 \\ v_4 = v_2 + v_3

## Exercise 1 (Graphs and Primal Trace)

Draw the computation graph and list the primal trace for the function

$$f(x_1, x_2) = \log(x_1) + x_1 x_2 - \sin(x_2)$$

Note: The graph and primal trace in general are not unique (different order / labels).

Hint: Pretend you are actually evaluating the function, and keep a running journal of what you do.

Solution (from Reference)

# A graph is all you need...

• Autodiff only requires a computational graph for the input of interest
• Can include recursions, conditionals (if statements), loops, etc.
• When you run the function, it just needs to take a trace of what it does with that input!
• This can include "fancy" stuff like...
• ODE Solver (Runge-Kutta, Implicit Euler)
• Fast Fourier Transform
• Matrix Cholesky Factorization
• Matrix (log-)Determinant

# Reminder: Multivariate Chain Rule

$$z_2(x_1, x_2) = x_1 x_2$$

\frac{\partial y}{\partial x_1}=
\frac{\partial y}{\partial z_1}\frac{\partial z_1}{\partial x_1}
+\frac{\partial y}{\partial z_2}\frac{\partial z_2}{\partial x_1}

$$x_1$$

$$y(z_1(x_1), z_2(x_1, x_2))$$

$$\qquad= 2x_1 + z_2(x_1, x_2)$$

$$x_2$$

$$z_1(x_1) = 2 x_1$$

f(x_1, x_2) = z_1(x_1) + z_2(x_1,x_2) = 2x_1 + x_1 x_2

# Multivariate Chain Rule

$$z_2(x_1, x_2) = x_1 x_2$$

\frac{\partial y}{\partial x_2}=
\frac{\partial y}{\partial z_2}\frac{\partial z_2}{\partial x_2}

$$x_1$$

$$y(z_1(x_1), z_2(x_1, x_2))$$

$$\qquad= 2x_1 + z_2(x_1, x_2)$$

$$x_2$$

$$z_1(x_1) = 2 x_1$$

f(x_1, x_2) = z_1(x_1) + z_2(x_1,x_2) = 2x_1 + x_1 x_2

\dot{v}

$$v_3 = v_0 v_1$$

$$x_1$$

$$v_4 = v_2 + v_3$$

$$x_2$$

$$v_2 = 2 v_0$$

$$v_0$$

$$v_1$$

$$f$$

$$y$$

f(x_1, x_2) = z_1(x_1) + z_2(x_1,x_2) = 2x_1 + x_1 x_2

Primal Trace

v_0 = x_1 \\ v_1 = x_2 \\ v_2 = 2v_0 \\ v_3 = v_0 v_1 \\ v_4 = v_2 + v_3

# $$\dot{v}_{k,i} := \frac{\partial v_k}{\partial x_{i}}$$

$$v_3 = v_0 v_1$$

$$x_1$$

$$v_4 = v_2 + v_3$$

$$x_2$$

$$v_2 = 2 v_0$$

$$v_0$$

$$v_1$$

$$f$$

$$y$$

(Partial) Tangent Trace: $$\dot{v}_{k,i} := \frac{\partial v_k}{\partial x_{i}}$$

$$\dot{v}_{0,1} = \frac{\partial v_0}{\partial x_1}=1$$

$$\dot{v}_{2,1} = \frac{\partial v_2}{\partial x_1}=\frac{\partial v_2}{\partial v_0}\frac{\partial v_0}{\partial x_1}= \frac{\partial v_2}{\partial v_0}\dot{v}_{0,1}$$

$$\dot{v}_{3,1} = \frac{\partial v_3}{\partial x_1}=\frac{\partial v_3}{\partial v_0}\frac{\partial v_0}{\partial x_1}= \frac{\partial v_3}{\partial v_0}\dot{v}_{0,1}$$

$$\dot{v}_{4,1} = \frac{\partial v_4}{\partial x_1}=\frac{\partial v_4}{\partial v_2}\frac{\partial v_2}{\partial x_1} + \frac{\partial v_4}{\partial v_3}\frac{\partial v_3}{\partial x_1}= \frac{\partial v_4}{\partial v_3}\dot{v}_{2,1} + \frac{\partial v_4}{\partial v_3}\dot{v}_{3,1}$$

Key Aspect: only expand the chain rule once at each stage, reusing the (forward) accumulated derivatives.

# Intuition

Primal Trace

v_0 = x_1 \\ v_1 = x_2 \\ v_2 = 2v_0 \\ v_3 = v_0 v_1 \\ v_4 = v_2 + v_3
f(x_1, x_2) = z_1(x_1) + z_2(x_1,x_2) = 2x_1 + x_1 x_2

Differentiability

In practice, we only need the function to be differentiable at the points we evaluate it along the tangent trace.

So e.g., the absolute value function $$|\cdot |$$

is not problematic in practice so long as we don't input zero!

# Let's generalize a little...

\dot{v}_k := \sum_{v_i \in {\rm parents}(v_k)}\frac{\partial v_k}{\partial v_i}\dot{v}_i.

The above is are elements of what is called the tangent trace

Note: Initializing $$\dot{\mathbf{x}} = (1,0)$$ or $$\dot{\mathbf{x}}=(0,1)$$ makes $$\dot{y}$$ be the derivative with respect to $$x_1$$ or $$x_2$$ respectively.

Let's now consider

$$v_3 = v_0 v_1$$

$$x_1$$

$$v_4 = v_2 + v_3$$

$$x_2$$

$$v_2 = 2 v_0$$

$$v_0$$

$$v_1$$

$$f$$

$$y$$

$$\dot{v}_0 = \dot{x}_1$$

$$\dot{v}_1 = \dot{x}_2$$

$$\dot{v}_3 = \dot{v}_0 v_1 + v_0 \dot{v}_1$$

$$\dot{v}_2 = 2\dot{v}_0$$

$$\dot{v}_4 = \dot{v}_2 + \dot{v}_3$$

$$\dot{y}$$

\dot{v}_k := \sum_{v_i \in {\rm parents}(v_k)}\frac{\partial v_k}{\partial v_i}\dot{v}_i

# Exercise 2 (Forward Mode)

\dot{v}_k := \sum_{i \in {\rm parents}(v_k)}\frac{\partial v_k}{\partial v_i}\dot{v}_i

With the function from the last example, compute the tangent trace for each graph node.

$$f(x_1, x_2) = \log(x_1) + x_1 x_2 - \sin(x_2)$$

Then, use it to evaluate $$\frac{\partial f}{\partial x_1}$$ at (2,5) by setting $$\dot \mathbf{x} = (1,0)$$

# Name that Mathematician

Carl Gustav Jacob Jacobi

Most famous for:

theory of elliptic functions

(also that matrix of derivatives from second-year calculus)

# Jacobian Matrix

\mathrm{J}_{\mathbf{f}} := \begin{bmatrix} \frac{dy_1}{dx_1} & \cdots & \frac{dy_1}{dx_n} \\ \vdots& \ddots& \vdots \\ \frac{dy_m}{dx_1} & \cdots & \frac{dy_m}{dx_n} \\ \end{bmatrix}

Quiz: The $$k$$-th row of $$\mathrm{J}_{\mathbf{f}}(\mathbf{x})$$ is equal to

\nabla y_k(\mathbf{x})
\mathbf{f}(\mathbf{x}) = (f_1(\mathbf{x}), \ldots, f_m(\mathbf{x})) = \mathbf{y}

$$\mathbf{f}: \mathbb{R}^n \rightarrow \mathbb{R}^m$$

\mathrm{J}_{\mathbf{f}} \in \mathbb{R}^{m \times n}

# Jacobian-Vector Product (JVP)

\mathrm{J}_{\mathbf{f}}(\mathbf{x})\, \dot{\mathbf{x}} = \begin{bmatrix} \frac{dy_1(\mathbf{x})}{dx_1} & \cdots & \frac{dy_1(\mathbf{x})}{dx_n} \\ \vdots& \ddots& \vdots \\ \frac{dy_m(\mathbf{x})}{dx_1} & \cdots & \frac{dy_m(\mathbf{x})}{dx_n} \\ \end{bmatrix} \begin{bmatrix} \dot{x}_1 \\ \vdots \\ \dot{x}_n \end{bmatrix} =\begin{bmatrix} s_1 \\ \vdots \\ s_m \end{bmatrix}

Forward Mode actually evaluates this for arbitrary $$\dot{\mathbf{x}}$$.

The Jacobian of a scalar-valued function $$f: \mathbb{R}^n \rightarrow \mathbb{R}$$ is its ...

gradient transposed: $$\nabla f(\cdot)^\top$$.

In this case, the Jacobian-Vector product is equal to

\nabla f(\mathbf{x})^\top \mathbf{\dot{x}} = s,

i.e., the

Directional Derivative of $$f$$($$\mathbf{x}$$)  in the direction $$\mathbf{\dot{x}}$$.

# Look at one part...

$$v_3 = v_0 v_1$$

$$x_1$$

$$x_2$$

$$v_2 = 2 v_0$$

$$v_0$$

$$v_1$$

$$\dot{v}_0 = \dot{x}_1$$

$$\dot{v}_1 = \dot{x}_2$$

$$\dot{v}_3 = \dot{v}_0 v_1 + v_0 \dot{v}_1$$

$$\dot{v}_2 = 2\dot{v}_0$$

\mathrm{J}_{v_3} = \big[\frac{\partial v_3}{\partial v_0}, \frac{\partial v_3}{\partial v_1}\big] = \big[ v_0, v_1\big]
\dot{v_3} = \mathrm{J}_{v_3}\begin{pmatrix} \dot{v}_0 \\ \dot{v}_1 \end{pmatrix}

# So...

\dot{v}_k := \sum_{i \in {\rm parents}(v_k)}\frac{\partial v_k}{\partial v_i}\dot{v}_i = \mathrm{J}_{v_k}\dot{v}_{{\rm parents}(v_k)}

Forward accumulation computes JVPs and passes them forward...

Note that we never need to form the full Jacobian matrix/matrices!

# The "Magic"

We can initialize $$\dot{\mathbf{x}} = \mathbf{r}$$ and compute any JVP: $$\mathrm{J}_f(\mathbf{x}) \mathbf{r}$$ in only one pass through the computation graph!

For $$f: \mathbb{R}^n \rightarrow \mathbb{R}$$, initializing with $$\dot{\mathbf{x}}=\mathbf{e}_i$$ gives the $$i$$-th partial derivative.

This involves roughly the same effort as computing the initial function twice!

# More generally...

\mathrm{J}_{\mathbf{f}}(\mathbf{x}) \begin{bmatrix} 0 \\ \vdots \\ 1 \end{bmatrix} = \begin{bmatrix} \frac{dy_1}{dx_1} & \cdots & \frac{dy_1}{dx_n} \\ \vdots& \ddots& \vdots \\ \frac{dy_m}{dx_1} & \cdots & \frac{dy_m}{dx_n} \\ \end{bmatrix} \begin{bmatrix} 0 \\ \vdots \\ 1 \end{bmatrix} =
\mathbf{f}(\mathbf{x}) = (f_1(\mathbf{x}), \ldots, f_m(\mathbf{x})) = \mathbf{y}

$$\mathbf{f}: \mathbb{R}^n \rightarrow \mathbb{R}^m$$

One forward mode sweep can evaluate a column of the Jacobian at $$\mathbf{x}$$!

Setting $$\dot{\mathbf{x}} = \mathbf{e}_i$$ evaluates the $$i$$-th column.

\begin{bmatrix} \frac{dy_1}{dx_m} \\ \vdots \\ \frac{dy_m}{dx_n} \\ \end{bmatrix}

# Quiz

Forward mode requires      sweeps of the graph to evaluate the full Jacobian.

For $$m=1$$, $$\nabla f(\mathbf{x}) = \mathrm{J}_f (\bf x)^\top$$           :(

\mathbf{f}: \mathbb{R}^n \rightarrow \mathbb{R}^m,\quad \mathrm{J}_{\mathbf{f}} \in \mathbb{R}^{m \times n} \text{ (Jacobian)}

For $$n=1, \mathrm{J}_f(\mathbf{x}) = \mathrm{J}_{f}(\mathbf{x})\times 1$$        :D

n

\overline{v}

# Back to Multivariate Chain Rule

$$z_2(x_1, x_2) = x_1 x_2$$

\frac{\partial y}{\partial x_1}=
\frac{\partial y}{\partial z_1}\frac{\partial z_1}{\partial x_1}
+\frac{\partial y}{\partial z_2}\frac{\partial z_2}{\partial x_1}

$$x_1$$

$$y(z_1(x_1), z_2(x_1, x_2))$$

$$\qquad= 2x_1 + z_2(x_1, x_2)$$

$$x_2$$

$$z_1(x_1) = 2 x_1$$

f(x_1, x_2) = z_1(x_1) + z_2(x_1,x_2) = 2x_1 + x_1 x_2

## Reverse Mode is Generalized Backpropagation

\widehat{y} = {\rm W}_2 \, {\rm max}\left\{\mathbf{0},{\rm W}_1 \mathbf{x} + b_1\right\} + b_2
\text{Loss}(\widehat{y},y)

$$\widehat{y}$$

$$y$$

$$b_1$$

$$b_2$$

$$x_1$$

$$x_2$$

$$x_3$$

## Everything Else:

\overline{v}_{k,i} = \frac{\partial y_i}{\partial v_k}

compare with (Partial) Tangents:

\dot{v}_{k,i} := \frac{\partial v_k}{\partial x_i}
\overline{v}_k := \sum_{i \in {\rm children}(v_k)}\textcolor{blue}{\frac{\partial v_i}{\partial v_k}}\overline{v}_i
\dot{v}_k := \sum_{i \in {\rm parents}(v_k)}\textcolor{red}{\frac{\partial v_k}{\partial v_i}}\dot{v}_i

\overline{v}
\dot{v}

# Exercise 3 (Reverse Mode)

\dot{v}_k := \sum_{i \in {\rm parents}(v_k)}\frac{\partial v_k}{\partial v_i}\dot{v}_i

With the function from the last exercises, compute the adjoint trace for each graph node.

$$f(x_1, x_2) = \log(x_1) + x_1 x_2 - \sin(x_2)$$

# Vector Jacobian Product (VJP)

(\mathbf{r}^\top \mathrm{J}_f)^\top = \mathrm{J}_f^\top \mathbf{r} = \begin{bmatrix} \frac{dy_1}{dx_1} & \cdots & \frac{dy_1}{dx_n} \\ \vdots& \ddots& \vdots \\ \frac{dy_m}{dx_1} & \cdots & \frac{dy_m}{dx_n} \\ \end{bmatrix}^\top \begin{bmatrix} r_1 \\ \vdots \\ r_m \end{bmatrix} =\begin{bmatrix} s_1 \\ \vdots \\ s_n \end{bmatrix}^\top

Evaluate the VJP by initializing the reverse phase with $$\overline{\mathbf{y}} = \mathbf{r}.$$

\overline{v}_k := \sum_{i \in {\rm children}(v_k)}\textcolor{blue}{\frac{\partial v_i}{\partial v_k}}\overline{v}_i

# Key Observation

This single node "backpropagation" is a Vector-Jacobian product!

$$v_3 = v_0 v_1$$

$$x_1$$

$$x_2$$

$$v_2 = 2 v_0$$

$$v_0$$

$$v_1$$

$$\overline{v}_0$$

$$\overline{v}_1 = \overline{x}_2$$

$$\overline{v}_3$$

$$\overline{v}_2$$

$$v_3 = {\rm SOLVE}_{\mathbf{y}} (\mathrm{v_0}\mathbf{y}=v_1)$$

$$\mathrm{A}$$

$$v_4 = v_2 v_3$$

$$\mathbf{x}$$

$$v_2 = \mathrm{A}^{1/2}$$

$$v_0$$

$$v_1$$

Graph nodes need not contain elementary operations: you can compress it so long as you know a nodes VJP function!

$$f$$

$$\mathbf{y}$$

# Primatives

Recall that Forward Mode with unit vectors for $$\dot{\mathbf{x}}$$ evaluates a column of the Jacobian.

Backward Mode with unit vectors for $$\dot{\mathbf{x}}$$ evaluates a row of the Jacobian.

# Quiz (Reverse Mode)

For $$f: \mathbb{R}^n \rightarrow \mathbb{R}$$, we can evaluate the gradient with     backward sweep(s).

1

For $$f: \mathbb{R} \rightarrow \mathbb{R}^m$$, we can obtain the Jacobian with       backward sweep(s).

m

$$\nabla(\mathbf{x})^\top = 1 \times {\rm J_f}(\mathbf{x})$$

(one row each time)

# Automatic Differentiation: Summary

Forward JVP:

Column
Backward VJP:  Row

\mathbb{R}^n \rightarrow \mathbb{R}^m
\mathbb{R}^m \rightarrow \mathbb{R}^n
\mathrm{J}_{\mathbf{f}}[i,:](\mathbf{x})
\mathrm{J}_{\mathbf{f}}[:,i](\mathbf{x})
(\dot{\mathbf{x}} \in \mathbb{R}^n)
(\overline{\mathbf{y}} \in \mathbb{R}^m)
\dot{\mathbf{x}}/\overline{\mathbf{y}} = \mathbf{e}_i
\mathbf{f}: \mathbb{R}^n \rightarrow \mathbb{R}^m,\quad \mathrm{J}_{\mathbf{f}} \in \mathbb{R}^{m \times n} \text{ (Jacobian)}
\big(\nabla f(\mathbf{x})^\top \text{ for } m=1\big)
\overline{\mathbf{y}}^\top \mathrm{J}_{\mathbf{f}} = (\mathrm{J}_{\mathbf{f}}^\top \overline{\mathbf{y}})^\top
\mathrm{J}_{\mathbf{f}}\dot{\mathbf{x}}

# Can you combine the two in smart ways?

Yes

(I'll give you the most useful example...)

# Hessians

\mathrm{H}_f(\mathbf{x}) := \begin{bmatrix} \frac{dy}{\partial x_1 \partial x_1} & \cdots & \frac{dy}{\partial x_1 \partial x_n} \\ \vdots& \ddots& \vdots \\ \frac{dy}{\partial x_n \partial x_1} & \cdots & \frac{dy_m}{dx_n dx_n} \\ \end{bmatrix}
• Suppose that $$f: \mathbb{R}^n \rightarrow \mathbb{R}$$. Then, the Hessian of $$f$$ is

$${\rm H}_{f}(\mathbf{x}) = \mathrm{J_{\nabla f}}(\mathbf{x}) = \textcolor{red}{\nabla} \big( \nabla f(\mathbf{x})\big)$$

• The Hessian is the                 of the               of $$f$$

Jacobian

• VJP (Forward Mode) to obtain trace of (*) $$\mathbf{u}^\top\nabla f(\mathbf{x})$$
• Take Gradient of (*) (Reverse Mode):
\nabla\left(\mathbf{r}^\top \nabla f(\mathbf{x})\right) = \mathbf{r}^\top \nabla\nabla f(\mathbf{x}) = \mathbf{r}^\top {\rm H} = ({\rm H}\mathbf{r})^\top

Hessian Vector Product (HVP)

Requires effort of (roughly) only four function evaluations!

Inverse-Hessian Vector Product

Can even evaluate $$\mathrm{H}^{-1}\mathbf{r}$$  by touching only HVPs (using the Conjugate Gradient method).

e.g., Neural Net with with $$10^6$$ parameters. Hessian has $$10^{12}$$ elements: 8 terabytes!

Question and Discussion Time!

# Part II : Practice

With its updated version of Autograd, JAX can automatically differentiate native Python and NumPy functions.

# Graphics Processing Units (GPUs)

Good at doing many linear algebra operations (VJP/JVP) at once.

$$v_3 = {\rm SOLVE}_{\mathbf{y}} (\mathrm{v_0}\mathbf{y}=v_1)$$

$$\mathrm{A}$$

$$v_4 = v_2 v_3$$

$$\mathbf{x}$$

$$v_2 = \mathrm{A}^{1/2}$$

$$v_0$$

$$v_1$$

$$f$$

$$\mathbf{y}$$

$$\overline{v}_1$$:JVP-2

$$\overline{v}_0$$:JVP-1

Parallelization

Forward Mode Reverse Mode GPU
Tensorflow

PyTorch

Jax
\checkmark
\checkmark
\checkmark
\checkmark
\checkmark
\checkmark
\checkmark
\checkmark
\checkmark
\text{\sf X}
\text{\sf X}
\text{\sf X}

# Python Frameworks

Generally reverse mode will be enough, but beware!

# Where's the VJP?!

Using software that doesn't have forward mode?

You can get a JVP using a VJP of a VJP.

By Rob Salomone

• 200