## Algorithmic/Automatic Differentiation – Part 1: Forward Mode

In this post, I’ll explain the basics of algorithmic differentiation (AD), a.k.a. automatic differentiation, which is a technique for programmatically evaluating derivatives of mathematical functions.

This is (again) the first part of a multi-part post. In this part I’ll just explain first order, “forward mode” AD. I will cover “reverse mode” AD in a second post and possibly higher order derivatives in a third. (Apologies to anybody waiting for part 2 of the sigma-algebras article. I promise I will finish that, but I can’t promise when. The topic of this article is directly relevant to something I’m currently working on.)

AD differs from the two other main ways of programmatically calculating derivatives, which are symbolic differentiation and numerical differentiation. Symbolic differentiation involves the manipulation of a symbolic expression, such as ${f(x) = x^2 + \ln x}$, applying the rules of differentiation to find an expression for the derivative, ${f'(x) = 2x + 1/x}$ in this example. Numerical differentiation instead involves invoking (multiple times) a computer program that implements the mathematical function in question, passing in slightly different input values each time and using the results to compute an approximation of the derivative.

AD resembles symbolic differentiation in that it involves applying the rules of differentiation rather evaluating the function for multiple different input values. However, it also resembles numerical differentiation in that the function to be differentiated is provided in the form of executable computer code, consisting of variable assignments, function calls, etc, rather than as a symbolic expression. Of course, these comments also imply that AD differs from both of these approaches. It is a third, distinct technique.

AD offers major benefits in terms of performance, numerical stability and ease of representation. It also has certain drawbacks, but despite these it is a very attractive option for programmatic differentiation, and is often preferable to the alternatives. I’m not going to go into the pros and cons here, though. The purpose of this article is just to explain how AD works, hopefully in a way that will make sense to people whose linear algebra and vector calculus is as rusty as mine…

1. Forward Mode

There are two main approaches to AD: forward mode (a.k.a. “tangent linear mode”) and reverse mode (a.k.a. “adjoint mode”). The former is the simplest to explain, so I’ll start there.

To set the scene, imagine we have a vector function ${F: \mathbb{R}^n \rightarrow \mathbb{R}^m}$. We have a programmatic implementation of this function, i.e. some code that evaluates ${ \mathbf{y} = F(\mathbf{x})}$ for a given input vector ${ \mathbf{x}}$, and we are going to be interested in evaluating the partial derivatives of the ${m}$ elements of the output with respect to the ${n}$ elements of the input. For now, though, we’ll look at some very basic vector calculus relating to ${F}$. The relevance of this to AD will become clear a little further on.

The Jacobian matrix of ${F}$ for a given ${\mathbf{x}}$, is an ${m \times n}$ matrix of partial derivatives, which we’ll write ${J_{F,\mathbf{x}}}$:

$J_{F,\mathbf{x}} = \left( \begin{array}{cccc} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \cdots & \frac{\partial y_1}{\partial x_n} \\[0.5em] \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_2}{\partial x_n} \\[0.5em] \vdots & \vdots & \ddots & \vdots \\[0.5em] \frac{\partial y_m}{\partial x_1} & \frac{\partial y_m}{\partial x_2} & \cdots & \frac{\partial y_m}{\partial x_n} \end{array} \right)$

The Jacobian matrix, being an ${m \times n}$ matrix, itself corresponds to a linear function from ${\mathbb{R}^n}$ to ${ \mathbb{R}^m}$, which I will refer to as the Jacobian function, or simply the Jacobian. I will use the same symbol for the Jacobian function as for the Jacobian matrix: ${J_{F,\mathbf{x}}: \mathbb{R}^n \rightarrow \mathbb{R}^m}$, ${J_{F,\mathbf{x}}(\mathbf{x}^\prime) = J_{F,\mathbf{x}}\mathbf{x}^\prime}$. Throughout this article, unless I explicitly state otherwise, whenever I refer to the Jacobian I will always be talking about the function, rather than the matrix. I won’t have much cause to talk about matrices.

The Jacobian function ${J_{F,\mathbf{x}}}$ can be considered a linear approximation to the original function, ${F}$, around the point ${\mathbf{x}}$. More precisely, if you move along a straight line in ${\mathbb{R}^n}$ from point ${\mathbf{x}}$ to point ${\mathbf{x} + \Delta\mathbf{x}}$ the image under ${F}$ will move along a (potentially curved) path from point ${\mathbf{y} = F(\mathbf{x})}$ to point ${ \mathbf{y} + \Delta\mathbf{y}}$, where ${\Delta\mathbf{y} = F(\mathbf{x} + \Delta\mathbf{x}) - F(\mathbf{x})}$. For a sufficiently small movement ${\Delta\mathbf{x}}$, the movement ${\Delta\mathbf{y}}$ is approximated by ${J_{F,\mathbf{x}}(\Delta\mathbf{x})}$. This is the multidimensional equivalent of moving along the tangent line as an approximation of moving along a curve in single-variable calculus.

An alternative interpretation of the function ${J_{F,\mathbf{x}}}$ is as a way of obtaining directional derivatives, which are a generalisation of partial derivatives. If ${\mathbf{y} = F(\mathbf{x})}$, then the partial derivative ${\partial y_i / \partial x_j}$ is the rate of change of the element ${y_i}$ as ${\mathbf{x}}$ moves in the direction of the ${j\textsuperscript{th}}$ axis of the input space, ${\mathbb{R}^n}$. Similarly, given any direction in ${ \mathbb{R}^n}$, the directional derivative (in that direction) of ${\mathbf{y}}$ with respect to ${\mathbf{x}}$ is a vector whose ${i\textsuperscript{th}}$ component is the rate of change of ${y_i}$ as ${\mathbf{x}}$ moves in the given direction.

The function ${J_{F,\mathbf{x}}}$, can be used to evaluate directional derivatives by passing in unit vectors. Given a unit vector ${\mathbf{u}}$ pointing in some direction, ${J_{F,\mathbf{x}}(\mathbf{u})}$ is the directional derivative of ${F}$ at ${\mathbf{x}}$, in the direction of ${\mathbf{u}}$. (In single-variable calculus, if you move along the tangent line until you’ve moved one unit along the x axis, the distance moved on the y axis is equal to the gradient.) As you’d expect, if ${\mathbf{u}}$ lies along one of the axes, the components of the directional derivative are the partial derivatives along that axis.

So, how does this relate to AD? The approach taken in forward-mode AD is to take a piece of code that evaluates the function ${F}$ and augment it, so that in addition to evaluating ${F(\mathbf{x})}$ for a particular input vector ${\mathbf{x}}$, it also evaluates the Jacobian function ${J_{F,\mathbf{x}}(\Delta\mathbf{x})}$ for some requested increment vector ${\Delta\mathbf{x}}$, at the same time. The reasons for doing this are twofold:

1. Because it gives us the derivatives! By passing in some unit vector of interest as ${\Delta\mathbf{x}}$, we can programmatically obtain the directional derivative of ${F}$ at ${\mathbf{x}}$ in the direction of that vector. By passing in unit vectors along the axes we are interested in, for example, we can obtain the partial derivatives along these axes.
2. Because we can! It turns out that we can perform this augmentation of the code without significantly altering its structure.

1.1. Program Structure

To expand on point 2 above, imagine that a computer program is composed of functions like the one pictured below, which takes some inputs and produces some outputs.

Given a function like this, we can imagine a corresponding box for evalauting the Jacobian function ${J_{F,\mathbf{x}}}$

The first thing to note is that the second box has twice as many inputs as the first, and they appear in blue/black pairs. The black inputs are intended to receive the same values as are given to the inputs of the original function, ${F}$, i.e. they represent ${\mathbf{x}}$. Clearly this is needed because each point in “input space” carries its own, distinct, Jacobian function (we differentiate at a particular point). The corresponding blue inputs are the components of the inputs to the Jacobian function, i.e. they represent ${\Delta\mathbf{x}}$, an increment vector in input space. The blue outputs are the result of evaluating the Jacobian function, i.e. an (approximated) increment in “output space”.

Now, say there’s a second function, ${G}$, to which some of the outputs of ${F}$ are passed, and suppose that we’ve also got the corresponding box for evaluating the Jacobian of ${G}$:

How can we hook things up so that we can evaluate the overall Jacobian of the whole program? It’s easy: for each output of ${F}$ that is passed to ${G}$, we just need to pass the corresponding output of ${J_F}$ to the corresponding input of ${J_G}$. We end up with one blue arrow in the lower diagram for each black arrow in the upper diagram:

The reason for the correspondance between the two diagrams is that the Jacobian function of the composition of two functions is simply the composition of the Jacobians of the individual functions, which is one way of stating the “chain rule” from differential calculus. In other words, if

${H(\mathbf{x}) = G(F(\mathbf{x}))}$

then

${J_{H, \mathbf{x}}(\Delta \mathbf{x}) = J_{G, F(\mathbf{x})}(J_{F,\mathbf{x}}(\Delta \mathbf{x}))}$

This is unsurprising, when you think about it: the Jacobians are linear approximations of the functions themselves, so you might expect them to compose in the same way. This parallel between composing functions and composing their corresponding Jacobians is what really makes AD tick, as it means that we don’t need to change the overall function-by-function, module-by-module structure of a computer program in order to make it evaluate Jacobians.[1]

As with ${F}$, the black inputs of ${J_G}$ are supposed to receive the same inputs as ${G}$ itself, i.e. two of the outputs of ${F}$. I won’t draw those arrows in, to avoid spaghetti, but it is important to note that they are implicitly there. They are needed because we need to know the point in ${G}$‘s input space at which the Jacobian of ${G}$ is being evaluated, not just the increment in input space (blue inputs) that we are interested in.

The fact that ${F}$‘s third output is not passed to ${G}$ poses no issues. Since the original function is unaffected by this value, so is its Jacobian linear approximation, so we simply leave the corresponding output in the lower diagram similarly unconnected. “Fan in” and “fan out” similarly pose no problems:

For the “fan out” case on the left, we can simply ignore either ${G}$ or ${H}$ and the connections leading to it, and consider the other in isolation. For the “fan in” case on the right, just observe that mathematically, ${F}$ and ${G}$ can be considered together as a single function with 4 inputs and 6 outputs. In both cases, we just end up drawing the same connections in the Jacobian diagram as exist in the original.

So, when faced with a more complex program consisting of a network of functions, like the one below, given corresponding Jacobian functions it’s clear how we can proceed. As we evaluate the program itself, whenever we evaluate a function, we also evaluate the corresponding Jacobian, passing in both the parameters we give to the original function and the corresponding blue outputs from the Jacobians we have evaluated in previous steps.

(Here, any inputs with no incoming arrows should be seen as overall inputs to the whole program, and any outputs without arrows should be seen as overall outputs. Every function has 2 inputs and 3 outputs because I’m lazy.)

It’s important to note that the mathematical “functions” above and their corresponding Jacobians aren’t necessarily “functions” in the computing sense. The above picture is a mental model of the computer program, and it can map onto the reality in a variety of different ways (and often in different ways for different parts of the program). For example, one such “function” might be a single step in an iterative loop, or a single line of code performing a multiplication operation. Similarly, the “passing” of data between these “functions” might correspond to reading/writing memory locations, or even networks or external storage, rather than passing parameters to subroutines.

Equally importantly, this mental model has a “nesting” behaviour. You might picture a network like the one above representing the program at a particular level of complexity, and then “zoom in” to a particular function to reveal that it is composed of sub-functions. The whole function’s Jacobian can be evaluated based on the Jacobians of the individual sub-functions.

What is needed in order for forward-mode AD to work is for a means to be supplied of evaluating Jacobians at some level of detail. We might choose, in one area of code, to implement individual Jacobian functions in a very finegrained way, and in another area to implement the Jacobian for a relatively large area of code as a single unit.

One common way to implement AD is by using the “overloading” facility provided by some programming languages, which allows calls to functions of the same name (or to operators such as + and *) to be routed to different implementations based on the types of the arguments.

Going back to our original diagram:

It’s possible to collapse this down and have a single function that evaluates both ${F}$ and ${J_F}$ simultaneously:

Now each black output is paired with its corresponding blue Jacobian output, which simplifies things somewhat. In this way, you could draw the 6 node “whole program” diagram above as a single network, with each connection propagating both the original function result and the Jacobian result that goes with it. (The reason I didn’t actually do that earlier is that I didn’t want to give the impression that you are required to use this approach. There’s no reason the code to evaluate the Jacobian can’t be separate from the original function, and there are reasons why you might deliberately choose to have this separation in particular applications. However, the approach of bundling ${F}$ and ${J_F}$ together certainly does work and it does rather simplify the diagrams.)

The operator overloading approach uses exactly this kind of bundling. It also makes use of a “fine-grained” approach to Jacobians, providing functions to evaluate the Jacobians only for the most basic operations, such as arithmetic operations and elementary maths functions like ln(), cos(), etc.

To allow Jacobians to be evaluated with minimal changes to the code, an augmented floating point type is provided, which is just a pair of floating point numbers. This type serves two purposes:

1. It allows a value and the corresponding increment (a black square and the corresponding blue square) to be carried in a single variable.
2. It allows calls to basic operations to be automatically routed to the augmented versions of their functions by means of overloading.

In languages that support it, code can often be switched over to use overloading-based AD simply by replacing all uses of the basic floating point type with the augmented type.[2]

Below is an extremely simplistic implementation of forward-mode AD by means of operator overloading in Python, purely for the purposes of illustration. I’m using Python here for convenience and simplicity, but I should give the caveat that you would be very unlikely to implement AD this way in the real world in a language like Python. The reason for this is that Python is dynamically typed, so the resolution of the calls to things like __add__ is done at run time. What would, without the AD, be a simple addition of two Python floats involves, in the code below, a blizzard of dictionary lookups and function calls. The resulting slowdown is likely to be extremely severe. In a language like C++, this resolution can be done at compile time, completely avoiding this runtime cost.

class ADForwardFloat(object):
def __init__(self, val, delta):
# The "actual" value (black square)
self.val = val
# The corresponding increment (blue square)
self.delta = delta

# Only implementing addition and multiplication here, and assuming
# that both operands are always ADForwardFloats (no mixing AD and
# ordinary floats). The formulae can be checked by working out the
# Jacobian matrix of each function on paper.
# If F(x,y) = x + y, then J_F,x,y(delta_x, delta_y) = delta_y + delta_x
return ADForwardFloat(self.val + other.val,
other.delta + self.delta)

def __mul__(self, other):
# If F(x,y) = x * y, then J_f,x,y(delta_x, delta_y) = y*delta_x + x*delta_y
return ADForwardFloat(self.val * other.val,
other.val * self.delta + self.val * other.delta)

# Now let's implement a function of three variables: x^2 + xy + xz
def func1(x,y,z):
return x * x + x * y + x * z

# Invoke the function with "ordinary" floats just gives us the result
print func1(3.0,4.0,5.0) # prints 36.0

# Now let's get the partial derivative of func1 with respect to x. To
# do this, we will need to pass in a unit vector pointing along the x
# axis as the increment for evaluating the Jaobian against, therefore
# we pass in 1 when creating the ADForwardFloat for x, and 0 for the
# others.
x = ADForwardFloat(3.0, 1.0)
y = ADForwardFloat(4.0, 0.0)
z = ADForwardFloat(5.0, 0.0)

result = func1(x,y,z)
print result.val # prints 36.0
print result.delta # prints 15.0, which is 2x+y+z


1.3. Other Implementation Options

It’s a mistake to think of AD as “an operator overloading trick”. Operator overloading is just one way of evaluating Jacobians, and it’s not always the best choice. There are at least two alternatives:

1. Programmatically apply a transformation to the code itself, automatically generating code that can evaluate Jacobians.
2. Hand-write the Jacobian evaluation code.

The latter approach may sound like a pain, but it can often be the most practical. Different methods can, of course, be used to evaluate Jacobians in different sections of code within the same program, with simple function composition used to combine these into the overall Jacobian.

## Footnotes

1. A lot of explanations of AD describe it in terms of Jacobian matrices rather than Jacobian functions, and therefore end up writing out equations and talking about the associativity of matrix multiplication. This seems entirely pointless, as it just obscures what’s going on. Forward-mode AD is about the creation and composition of Jacobian functions. Matrices and matrix multiplications are just one way of representing/implementing these linear functions and the act of composing them, and they’re not even used all that frequently in real world forward-mode AD. Composition, for example, is more commonly achieved by simply passing the result of one function to the input of another than by performing a matrix multiplication. There’s really no benefit in using the matrix representation to explain the underlying maths.

2. Of course, in real life things aren’t so simple. Operator overloading is far from a magic wand that can be waved over existing code, and there are many reasons you might choose not to use it in a given application of AD.