# Introduction to non-linear optimization for Image Registration

The problem of image registration consists of aligning two images taken from different perspectives in order to obtain a larger visual representation of some large structure. For instance, one can stitch several overlapping aerial pictures to obtain the map of an entire city. Another example is the creation of panoramic pictures, which consists of stitching pictures taken by rotating the camera. The registration problem is also present in the medical fields. It is required to register medical imagery in order to accurately perform diagnostics as well as detect changes during several types of treatments.

With such a broad spectrum of applications, one can expect the registration problem to be presented with countless particularities. The signals being registered can have an arbitrary amount of dimensions; they might not even contain the same representation of the structure being mapped in the first place. The data from an MRI (representing soft tissues) can be registered with CT scans (more sensitive to high density objects); one may also attempt to register satellite imagery with cartographic maps.

The objective of this article is to gently introduce how the registration problem can be addressed with a non-linear optimization framework, from the Taylor series until the formulation of the Homography Jacobian and Hessian. In this article, we will focus on rigid registration, although the concepts herein can also be applied to the non-rigid case.

## Mathematical formulation

The image registration problem consists of finding the coefficients $\mathbf{p}_f=\{p_0, p_1, p_2, \ldots\}$ of the coordinate transforming function $w(\mathbf{x}, \mathbf{p})$ such that

$\mathbf{p}_f=\displaystyle \underset{\mathbf{p}}{\mathop{\mathrm{arg\,max}}} \; \underset{\mathbf{x} \, \in \, \mathbf{I} \cap \mathbf{I^*}}\sum \Psi(\mathbf{I}(w(\mathbf{x}, \mathbf{p})), \mathbf{I^*}(\mathbf{x})). \ \ \ \ \ (I)$

Here is the explanation for each element in this formula:

• $\mathbf{I}$ and $\mathbf{I^*}$ are the images being registered.
• $\mathbf{x}$ has as many dimensions as the input images. When dealing with 2D images, $\mathbf{x}$ is a 2D vector $\langle x,y \rangle$.
• The coordinate transforming function $w$ is just a transformation for a coordinate $\mathbf{x}$. Some examples of such transforms are the translation, scale, rotation, sheering, etc. Although it can be arbitrarily complex, we shall focus on the Homography transform later on.
• The vector $\mathbf{p}$ corresponds to the parameters of the function $w$, and therefore contains as many elements as the number of parameters that $w$ requires. For instance, a translation contains two parameters $\langle x,y\rangle$, while a rotation might be represented by a scalar $\alpha$ and a rotation axis $\mathbf{v}$. We shall see another example when formulating the Homography transform later.
• While the symbols $\mathbf{I}$ and $\mathbf{I^*}$ represent two images, the expressions $\mathbf{I^*}(\mathbf{x})$ and $\mathbf{I}(w(\mathbf{x}, \mathbf{p}))$ represent the evaluation of the pixel intensities at coordinate $\mathbf{x}$ of image $\mathbf{I^*}$ and coordinate $w(\mathbf{x}, \mathbf{p})$ of image $\mathbf{I}$, respectively.
• $\mathbf{I} \cap \mathbf{I^*}$ is a shorthand for the intersection between the image $\mathbf{I}$ and the transformed image $\mathbf{I^*}$. Therefore, the summation accounts for all coordinates $\mathbf{x}$ that belong to the intersection between $\mathbf{I}$ and the transformed $\mathbf{I^*}$. In other words, $\mathbf{x}$ must lie within $\mathbf{I^*}$ and $w(\mathbf{x}, \mathbf{p})$ must lie within $\mathbf{I}$.
• $\Psi$ is a similarity function. It is high when its parameters are similar, and low when its parameters are different. There are many ways to measure similarity between images, such as sum of absolute differences (SAD), sum of squared differences (SSD), normalized cross correlation (NCC), etc. Later on, we shall focus on the Mutual Information metric.
• Finally, the vector $\mathbf{p}_f$ corresponds to the parameters of $w$ that maximize the similarity between images $\mathbf{I}$ and $\mathbf{I^*}$. In other words, it defines the transformation to be applied to image $\mathbf{I}$ that will maximize its alignment to image $\mathbf{I^*}$.

Note that the image similarity function is expected to have a lowest upper bound. In other words, there is a maximum theoretical possible value that the similarity function may return; the images can’t simply be more similar than that. There might be the case in which many different transform parameters make the similarity function return that value, but no transformation parameter should get you beyond that level of similarity.

## Newton Method for non-linear optimization

In order to understand the Newton algorithm for non-linear optimization, consider the 1D example function below:

As our goal is to reach the global maximum of any arbitrary similarity function $f$, it seems prudent to first have a practical tool for detecting maxima points in the first place. One such a tool consists of looking for points whose slope is zero:

Since the slope of $f$ is given by its derivative with respect to the parameters being optimized, $df/dx$, the maxima points can be found as the solution to the following system:

$\frac{df(x)}{dx}=f'(x)=0$

In many practical applications, including the image registration problem, the solution to this system can’t be found analytically. This is where we use Newton’s method for finding roots of differentiable functions.

The idea of Newton’s method for solving $g(x)=0$ is to iteratively approach the solution $x$, starting at any initial guess $x_0$ and skipping to a better point $x_1$, which is given by the intersection between the line that is tangent to $g(x_0)$ and the horizontal axis, as shown in the figure below.

Since this line is tangent to $g(x_0)$, its slope must be $g'(x_0)$. As a line, its equation is given by $y = g'(x_0)x + b$. Since it evaluates to $g(x_0)$ at $x=x_0$, then $b=g(x_0)-g'(x_0)x_0$. Finally, since $x_1$ is located at the intersection between this line and $y=0$, we can find $x_1$:

\begin{aligned} 0 &=g'(x_0)x_1+g(x_0)-g'(x_0)x_0, \\ x_1&=\frac{g'(x_0)x_0-g(x_0)}{g'(x_0)}=x_0-\frac{g(x_0)}{g'(x_0)}. \end{aligned}

This very same procedure can be repeated to compute an $x_2$, by using $x_1$ as its own initial guess, and an $x_3$ after that. Generalizing, each new point $x_2$ in the iterative scheme can be determined by

\begin{aligned} x_{i+1}=x_i-\frac{g(x_i)}{g'(x_i)}, \end{aligned}

until some convergence criteria is satisfied (for instance, until $|g(x_i)| < \epsilon$).

Since $g(x)=f'(x)$, we can finally reach the formulation to each iteration of the Newton’s method for finding a local minimum of $f(x)$, starting at an arbitrary initial guess $x_0$:

\begin{aligned} x_{i+1}=x_i-\frac{f'(x_i)}{f''(x_i)}. \ \ \ \ \ (II) \end{aligned}

There are a few caveats to this method, though. Notice how the example function $f$ contains both minima and maxima, and the slope of $f$ is zero at all of them. The Newton method may converge to any of these inflection points, so you must check upon convergence whether you’ve reached a maximum or a minimum inflection point by verifying the signal of $f''(x_f)$, where $x_f$ is the convergence point. If $f''(x_f)>0$, then you’ve reached a local minimum; if $f''(x_f)<0$, then you have found a local maximum. If, however, $f''(x_f)=0$, then the test is inconclusive and $x_f$ might not even be either maximum nor minimum! For instance, consider the function $f(x)=x^3$, evaluated at $x=0$: this inflection point is not interesting at all for our purpose.

## Taylor Series: Approximating functions by polynomials

In the next section, we will talk about the geometrical interpretation of the Newton method. Before we do so, let’s back up a little and recall the concept of Taylor Series. Assuming that the function $f$ is infinitely differentiable at $x=0$, we can write it as an infinite polynomial:

$f(x) = a_0 + a_1x + a_2x^2 + \ldots$

The coefficients of such polynomial can be computed by taking the derivatives of $f$ and evaluating them at$x=0$:&& \omit\rlap{\ldots} \\

\begin{aligned} a_0 &= f(0), \\ f'(x) &= a_1 + 2a_2x + 3a_3x^2+\ldots \implies f'(0) = a_1, \\ f''(x) &= 2!a_2 + 3!a_3x + \ldots \implies f''(0) = 2a_2, \\ f'{^n}'(x) &= n!a_n + \ldots \implies f'{^n}{^\prime}(0) = n!a_n. \end{aligned}

Therefore, we have

\begin{aligned}f(x) &= f(0) + f'(0)x + f''(0)\frac{x^2}{2!} + f'''(0)\frac{x^3}{3!} + \ldots, \\ f(x) &= \sum_{i=0}^\infty f'{^i{^\prime}}(0)\frac{x^i}{i!}. \end{aligned}

Let’s take a moment to contemplate the implications of this concept. It is clear that the Taylor series will represent our arbitrary function $f$ when all of its infinite terms are present. Nevertheless, it is interesting to see what happens in the case when only its first $n$ terms are used to describe an approximation of $f$ via a finite polynomial $g$. When $n=1$, $g(x) = f(0)$ and $g$ is only a horizontal line that crosses $f$ at $x=0$.

When $n=2$, however, something interesting starts to happen. Since in this case $g(x) = f(0) + f'(0)x$, then we can say that the derivative $g'(x)$ has the same value of the derivative $f'(x)$ when $x=0$. That is, $g$ is the tangent of $f$ at $x=0$.

By using this very same reasoning, we can say that every additional term included in our Taylor series will make $g$ smoothly approximate $f$ at $x=0$ down to an additional derivative. In other words, we can find a polynomial of arbitrary degrees that shall smoothly approximate the original function $f$ at $x=0$.

Naturally, the real power of the Taylor series comes when it is adapted to perform such approximation at any arbitrary location of the $x$ axis, not just the origin. To achieve this adaptation, we need a polynomial $\tilde{g}$ such that

\begin{aligned} \tilde{g}(k) &= f(k), \\ \tilde{g}'(k) &= f'(k), \\ \tilde{g}''(k) &= f''(k), \\ & \omit\rlap{\ldots} \\ \tilde{g}'{^n}{^\prime}(k) &= f'{^n}{^\prime}(k). \ \ \ \ \ (III) \\ \end{aligned}

By making $\tilde{g}(x) = a_0 + a_1(x-k) + a_2(x-k)^2 + \ldots$ and exploring the relationships in $(III)$, we can find $a_n$ by computing the n-th derivative of $g$ as follows:

\begin{aligned} \tilde{g}(k) &= a_0 & \implies a_0 &= f(k), \\ \tilde{g}'(k) &= a_1 & \implies a_1 &= f'(k), \\ \tilde{g}''(k) &= 2a_2 & \implies a_2 &= \frac{f''(k)}{2}, \\ && \omit\rlap{\ldots} \\ \tilde{g}'{^n}{^\prime}(k) &= n!a_n & \implies a_n&=\frac{f'{^n}{^\prime}(k)}{n!}.\\ \end{aligned}

Finally, we have that the polynomial of degree $n$ that smoothly approximates $f$ at $x=k$ is given by

\begin{aligned} \tilde{g}(x) = \sum_{i=0}^n f'{^i}{^\prime}(k) \frac{(x-k)^i}{i!}. \end{aligned}

## Geometrical Interpretation of the Newton method

Once again, let $f$ be a function whose minimum (or maximum) we want to find. Also, let $x_0$ be the initial guess that will be used by the Newton’s method.

Each iteration of the Newton’s method corresponds to finding the second degree Taylor series $\tilde{g}$ that approximates $f$ at $x=x_0$, and defining the root of the next iteration $x_1$ as the inflection point of $\tilde{g}$. Mathematically,

\begin{aligned} g(x) =& \ f(x_0) + f'(x_0)(x-x_0) + f''(x_0)\frac{(x-x_0)^2}{2}, \\ g'(x_1) =& \ 0 \implies f'(x_0) + f''(x_0)(x_1-x_0)=0, \\ \implies & x_1=x_0-\frac{f'(x_0)}{f''(x_0)}, \end{aligned}

which generalizes exactly as shown in equation $(II)$.

This slideshow requires JavaScript.

## Formulations to Iterative Image Registration

There are several ways to adapt the equation $(I)$ for use with Newton’s method. Note that, due to the iterative nature of this method, we are always seeking to determine an incremental change to the parameters being optimized $\mathbf{p}$, which will bring us closer to the maximum of the image similarity function. The calculation of such incremental change to $\mathbf{p}$ involves several extremely computationally expensive tasks, such as calculation of image gradients, Jacobian and Hessian of the transform function (evaluated at each pixel!), and so forth.

Therefore, we need to formulate our iterative scheme carefully to avoid as much calculations as possible. Below, I briefly introduce the most common ways to perform this task.

### Forward additive

The forward additive formulation is possibly the simplest approach, and combines the classical formulation of Newton’s method with the most basic Taylor series formulation, which approximates the function at the origin (that is, at $x=0$). The idea here is that, at the $i$-th iteration, the origin of the similarity function is re-centered at $\mathbf{p}_i$:

\begin{aligned} f(\mathbf{p}_i, \Delta \mathbf{p}) = \sum_{\mathbf{x}} \Psi(\mathbf{I}(w(\mathbf{x}, \mathbf{p}_i+\Delta \mathbf{p})), \mathbf{I}^*(\mathbf{x})). \end{aligned}

Note that the warp function is now expressed as $w(\mathbf{x}, \mathbf{p}_i + \Delta \mathbf{p})$. Here, $\Delta \mathbf{p}$ is the variable, and $\mathbf{p}_i$ is a constant. Therefore, the Taylor series of the image similarity function expanded at $\Delta \mathbf{p}=0$ will approximate it at $\mathbf{p}_i$.

At each iteration, we will find a $\Delta \mathbf{p}$ that brings us closer to the desired maximum, and then we find $\mathbf{p}_{i+1} \gets \mathbf{p}_i + \Delta \mathbf{p}$.

By using this formulation, we usually end up having to calculate some very convoluted Jacobian and Hessian matrices for the transform function $w$. The Forward compositional formulation was proposed to simplify this procedure a little bit.

### Forward compositional

The idea of the forward compositional formulation is to use function composition to describe how to incrementally modify $\mathbf{p}_i$ in order to reach $\mathbf{p}_{i+1}$:

\begin{aligned} w(\mathbf{x}, \mathbf{p}_{i+1}) = w(w(\mathbf{x}, \Delta \mathbf{p}), \mathbf{p}_i). \end{aligned}

In order to be able to use this formulation, the image similarity function must be also reformulated:

\begin{aligned} f(\mathbf{p}_i, \Delta \mathbf{p}) = \sum_{\mathbf{x}} \Psi(\mathbf{I}(w(w(\mathbf{x}, \Delta \mathbf{p}),\mathbf{p}_i)), \mathbf{I}^*(\mathbf{x})). \end{aligned}

The nice thing about this formulation is that when elaborating the first and second derivatives of the image similarity function $f$ (which are required for the Taylor series expansion, as described previously), you will eventually have to evaluate the Jacobian and Hessian matrices of $w(\mathbf{x}, \Delta \mathbf{p})$ at $\Delta \mathbf{p}=\mathbf{0}$. There are two good things about this: The first is that the analytical derivation of these matrices can be greatly simplified. Secondly, at least part of these matrices can be precomputed before running the iterations of Newton’s method.

All things considered, the forward compositional formulation may make our job a little easier, but the computational complexity of the optimization algorithm remains essentially unchanged. Depending on how you model your optimization framework, you can drastically decrease the computational complexity by using the Inverse compositional approach.

### Inverse compositional

The problem with the previous formulations is that, when computing the first and second derivatives of $f$ with respect to $\Delta \mathbf{p}$, we usually end up with formulas that depend on $\mathbf{p}_i$. Since $\mathbf{p}_i$ is updated at each iteration, we end up having to recompute all terms that depend on it as well.

One way to improve this situation is to modify the image similarity function so that the reference image $\mathbf{I}^*$ is warped by $\Delta \mathbf{p}$, so depending on the similarity function, we end up being able to precompute a large amount of matrices:

\begin{aligned} f(\mathbf{p}_i, \Delta \mathbf{p}) = \sum_{\mathbf{x}} \Psi(\mathbf{I}(w(\mathbf{x}, \mathbf{p}_i)), \mathbf{I}^*(w(\mathbf{x}, \Delta \mathbf{p}))). \end{aligned}

This adaptation requires the update step to also be reconsidered. Since $\Delta \mathbf{p}$ now describes a transform that brings $\mathbf{I}^*$ closer to $\mathbf{I}$, we need to adjust the update step in order to find $\mathbf{p}_{i+1}$ given $\mathbf{p}_i$ and $\Delta \mathbf{p}$.

This can be done as follows. We know that applying $\Delta \mathbf{p}$ to $\mathbf{I}^*$ and $\mathbf{p}_i$ to $\mathbf{I}$ must be equivalent to applying $\mathbf{0}$ to $\mathbf{I}^*$ and $\mathbf{p}_{i+1}$ to $\mathbf{I}$, that is,

\begin{aligned} \Psi(\mathbf{I}(w(\mathbf{x}, \mathbf{p}_i)), \mathbf{I}^*(w(\mathbf{x},\Delta \mathbf{p}))) &= \Psi(\mathbf{I}(w(\mathbf{x}, \mathbf{p}_{i+1})), \mathbf{I}^*(w(\mathbf{x}, \mathbf{0})))\\ &=\Psi(\mathbf{I}(w(\mathbf{x}, \mathbf{p}_{i+1})), \mathbf{I}^*(\mathbf{x})). \ \ \ \ \ (IV) \end{aligned}

To visualize this, compare the before and after a step of the Newton’s method. The point $\mathbf{p}_{i+1}$ corresponds to the optimum of the parabola that passes through $\mathbf{p}_i$ and can be reached via a step of $\Delta \mathbf{p}$. To determine how this step is applied, let’s replace $\mathbf{x}$ by $w^{-1}(\mathbf{x}, \Delta \mathbf{p})$ in the left side of equation $(IV)$:

\begin{aligned} & \ \Psi(\mathbf{I}(w(w^{-1}(\mathbf{x}, \Delta \mathbf{p}), \mathbf{p}_i)), \mathbf{I}^*(w(w^{-1}(\mathbf{x}, \Delta \mathbf{p}),\Delta \mathbf{p}))) \\ =& \ \Psi(\mathbf{I}(w(w^{-1}(\mathbf{x}, \Delta \mathbf{p}), \mathbf{p}_i)), \mathbf{I}^*(\mathbf{x})). \ \ \ \ \ (V) \end{aligned}

Now compare the result of equation $(V)$ with the right side of equation $(IV)$ and we see that the update step can now be written as:

\begin{aligned} w(\mathbf{x}, \mathbf{p}_{i+1}) = w(w^{-1}(\mathbf{x}, \Delta \mathbf{p}), \mathbf{p}_i) \end{aligned}

### Further reading

I’ve only barely scratched the surface on the topic of formulations to the image registration problem. Each one of them have their own requirements in terms of what types of transform functions $w$ are supported. If these requirements are met, they are all proven to be equivalent. The curious reader is invited to read the paper Lucas-Kanade 20 Years On: A Unifying Framework, by S. Baker and I. Matthews.

At this point, some concepts evoked in that paper may still sound like black magic to the reader, especially those who have not studied the fields of real analysis. For now, the most important are the requirements for the transform $w$:

• The forward compositional formulation requires the “set of warps to form a semigroup“. This means that the composition of transforms must be associative – $((w \circ w) \circ w) = (w \circ (w \circ w))$; and they must obey the closure axiom, that is, the result of a composition of warps must also be a warp. Finally, although the paper doesn’t explicitly say so, there must also be the identity warp parameter $\mathbf{p}=\mathbf{0}$ with which the warp does not modify a point – that is, $w(\mathbf{x}, \mathbf{0}) = \mathbf{x}$.
• The inverse compositional formulation has all the requirements from the forward compositional formulation, as well as an additional requirement: The warp functions must also be invertible (and that inverse must also be a valid warp function). Obeying all these axioms is what makes it a group.

## Final thoughts

This was the first post of a series about how to solve the image tracking problem via a non-linear optimization framework. In this post, I introduced the general idea of the iterative Newton’s method and briefly talked about how an arbitrary image similarity function can be adapted to be used with this framework. In the next post, I’ll dive into the details of one example, using the inverse compositional formulation combined with the Mutual Information similarity function.

It is clear that the Newton’s method is sensitive to the initial guess provided to it. Depending on the value provided to $\mathbf{p}_0$, there is always a risk of convergence to local maxima of the image similarity function. There are several ways to overcome these obstacles: For instance, one can use a robust estimator such as a keypoint based registration to determine $\mathbf{p}_0$. Furthermore, it is possible to perform the tracking algorithm in several scales of the input images, from the coarser to the finest, using the output of one scale as the input to the next scale. This has been shown in the literature to vastly improve both the convergence region as well as the convergence rate of the iterative algorithms.

Nevertheless, depending on how you model your solution, there are probably better methods than Newton’s out there, such as Gauss-Newton and Levenberg-Marquardt.  The Lucas-Kanade modelling of the image registration problem has been shown to work much better using these algorithms.

Advertisements