Automatic Derivatives

We will now consider automatic differentiation. It is a technique that can compute exact derivatives, fast, while requiring about the same effort from the user as is needed to use numerical differentiation.

Don’t believe me? Well here goes. The following code fragment implements an automatically differentiated CostFunction for Rat43.

struct Rat43CostFunctor {
  Rat43CostFunctor(const double x, const double y) : x_(x), y_(y) {}

  template <typename T>
  bool operator()(const T* parameters, T* residuals) const {
    const T b1 = parameters[0];
    const T b2 = parameters[1];
    const T b3 = parameters[2];
    const T b4 = parameters[3];
    residuals[0] = b1 * pow(1.0 + exp(b2 -  b3 * x_), -1.0 / b4) - y_;
    return true;
  }

  private:
    const double x_;
    const double y_;
};


auto* cost_function = new AutoDiffCostFunction<Rat43CostFunctor, 1, 4>(x, y);

Notice that compared to numeric differentiation, the only difference when defining the functor for use with automatic differentiation is the signature of the operator().

In the case of numeric differentiation it was

bool operator()(const double* parameters, double* residuals) const;

and for automatic differentiation it is a templated function of the form

template <typename T> bool operator()(const T* parameters, T* residuals) const;

So what does this small change buy us? The following table compares the time it takes to evaluate the residual and the Jacobian for Rat43 using various methods.

CostFunction

Time (ns)

Rat43Analytic

255

Rat43AnalyticOptimized

92

Rat43NumericDiffForward

262

Rat43NumericDiffCentral

517

Rat43NumericDiffRidders

3760

Rat43AutomaticDiff

129

We can get exact derivatives using automatic differentiation (Rat43AutomaticDiff) with about the same effort that is required to write the code for numeric differentiation but only \(40\%\) slower than hand optimized analytical derivatives.

So how does it work? For this we will have to learn about Dual Numbers and Jets .

Dual Numbers & Jets

Note

Reading this and the next section on implementing Jets is not necessary to use automatic differentiation in Ceres Solver. But knowing the basics of how Jets work is useful when debugging and reasoning about the performance of automatic differentiation.

Dual numbers are an extension of the real numbers analogous to complex numbers: whereas complex numbers augment the reals by introducing an imaginary unit \(\iota\) such that \(\iota^2 = -1\), dual numbers introduce an infinitesimal unit \(\epsilon\) such that \(\epsilon^2 = 0\) . A dual number \(a + v\epsilon\) has two components, the real component \(a\) and the infinitesimal component \(v\).

Surprisingly, this simple change leads to a convenient method for computing exact derivatives without needing to manipulate complicated symbolic expressions.

For example, consider the function

\[f(x) = x^2 ,\]

Then,

\[\begin{split}\begin{align} f(10 + \epsilon) &= (10 + \epsilon)^2\\ &= 100 + 20 \epsilon + \epsilon^2\\ &= 100 + 20 \epsilon \end{align}\end{split}\]

Observe that the coefficient of \(\epsilon\) is \(Df(10) = 20\). Indeed this generalizes to functions which are not polynomial. Consider an arbitrary differentiable function \(f(x)\). Then we can evaluate \(f(x + \epsilon)\) by considering the Taylor expansion of \(f\) near \(x\), which gives us the infinite series

\[\begin{split}\begin{align} f(x + \epsilon) &= f(x) + Df(x) \epsilon + D^2f(x) \frac{\epsilon^2}{2} + D^3f(x) \frac{\epsilon^3}{6} + \cdots\\ f(x + \epsilon) &= f(x) + Df(x) \epsilon \end{align}\end{split}\]

Here we are using the fact that \(\epsilon^2 = 0\).

A Jet is a \(n\)-dimensional dual number, where we augment the real numbers with \(n\) infinitesimal units \(\epsilon_i,\ i=1,...,n\) with the property that \(\forall i, j\ :\epsilon_i\epsilon_j = 0\). Then a Jet consists of a real part \(a\) and a \(n\)-dimensional infinitesimal part \(\mathbf{v}\), i.e.,

\[x = a + \sum_j v_{j} \epsilon_j\]

The summation notation gets tedious, so we will also just write

\[x = a + \mathbf{v}.\]

where the \(\epsilon_i\)’s are implicit. Then, using the same Taylor series expansion used above, we can see that:

\[f(a + \mathbf{v}) = f(a) + Df(a) \mathbf{v}.\]

Similarly for a multivariate function \(f:\mathbb{R}^{n}\rightarrow \mathbb{R}^m\), evaluated on \(x_i = a_i + \mathbf{v}_i,\ \forall i = 1,...,n\):

\[f(x_1,..., x_n) = f(a_1, ..., a_n) + \sum_i D_i f(a_1, ..., a_n) \mathbf{v}_i\]

So if each \(\mathbf{v}_i = e_i\) were the \(i^{\text{th}}\) standard basis vector, then, the above expression would simplify to

\[f(x_1,..., x_n) = f(a_1, ..., a_n) + \sum_i D_i f(a_1, ..., a_n) \epsilon_i\]

and we can extract the coordinates of the Jacobian by inspecting the coefficients of \(\epsilon_i\).

Implementing Jets

In order for the above to work in practice, we will need the ability to evaluate an arbitrary function \(f\) not just on real numbers but also on dual numbers, but one does not usually evaluate functions by evaluating their Taylor expansions,

This is where C++ templates and operator overloading comes into play. The following code fragment has a simple implementation of a Jet and some operators/functions that operate on them.

template<int N> struct Jet {
  double a;
  Eigen::Matrix<double, 1, N> v;
};

template<int N> Jet<N> operator+(const Jet<N>& f, const Jet<N>& g) {
  return Jet<N>(f.a + g.a, f.v + g.v);
}

template<int N> Jet<N> operator-(const Jet<N>& f, const Jet<N>& g) {
  return Jet<N>(f.a - g.a, f.v - g.v);
}

template<int N> Jet<N> operator*(const Jet<N>& f, const Jet<N>& g) {
  return Jet<N>(f.a * g.a, f.a * g.v + f.v * g.a);
}

template<int N> Jet<N> operator/(const Jet<N>& f, const Jet<N>& g) {
  return Jet<N>(f.a / g.a, f.v / g.a - f.a * g.v / (g.a * g.a));
}

template <int N> Jet<N> exp(const Jet<N>& f) {
  return Jet<T, N>(exp(f.a), exp(f.a) * f.v);
}

// This is a simple implementation for illustration purposes, the
// actual implementation of pow requires careful handling of a number
// of corner cases.
template <int N>  Jet<N> pow(const Jet<N>& f, const Jet<N>& g) {
  return Jet<N>(pow(f.a, g.a),
                g.a * pow(f.a, g.a - 1.0) * f.v +
                pow(f.a, g.a) * log(f.a); * g.v);
}

With these overloaded functions in hand, we can now call Rat43CostFunctor with an array of Jets instead of doubles. Putting that together with appropriately initialized Jets allows us to compute the Jacobian as follows:

class Rat43Automatic : public ceres::SizedCostFunction<1,4> {
 public:
  Rat43Automatic(const Rat43CostFunctor* functor) : functor_(functor) {}
  virtual ~Rat43Automatic() {}
  virtual bool Evaluate(double const* const* parameters,
                        double* residuals,
                        double** jacobians) const {
    // Just evaluate the residuals if Jacobians are not required.
    if (!jacobians) return (*functor_)(parameters[0], residuals);

    // Initialize the Jets
    ceres::Jet<4> jets[4];
    for (int i = 0; i < 4; ++i) {
      jets[i].a = parameters[0][i];
      jets[i].v.setZero();
      jets[i].v[i] = 1.0;
    }

    ceres::Jet<4> result;
    (*functor_)(jets, &result);

    // Copy the values out of the Jet.
    residuals[0] = result.a;
    for (int i = 0; i < 4; ++i) {
      jacobians[0][i] = result.v[i];
    }
    return true;
  }

 private:
  std::unique_ptr<const Rat43CostFunctor> functor_;
};

Indeed, this is essentially how AutoDiffCostFunction works.

Pitfalls

Automatic differentiation frees the user from the burden of computing and reasoning about the symbolic expressions for the Jacobians, but this freedom comes at a cost. For example consider the following simple functor:

struct Functor {
  template <typename T> bool operator()(const T* x, T* residual) const {
    residual[0] = 1.0 - sqrt(x[0] * x[0] + x[1] * x[1]);
    return true;
  }
};

Looking at the code for the residual computation, one does not foresee any problems. However, if we look at the analytical expressions for the Jacobian:

\[\begin{split} y &= 1 - \sqrt{x_0^2 + x_1^2}\\ D_1y &= -\frac{x_0}{\sqrt{x_0^2 + x_1^2}},\ D_2y = -\frac{x_1}{\sqrt{x_0^2 + x_1^2}}\end{split}\]

we find that it is an indeterminate form at \(x_0 = 0, x_1 = 0\).

There is no single solution to this problem. In some cases one needs to reason explicitly about the points where indeterminacy may occur and use alternate expressions using L’Hôpital’s rule (see for example some of the conversion routines in rotation.h. In other cases, one may need to regularize the expressions to eliminate these points.