Introduction to non-linear optimization for Image Registration

belo_horizonte_aerial_view
Aerial view of Belo Horizonte. The frames were taken from Google Maps.

The problem of image registration consists of aligning two images taken from different perspectives in order to obtain a larger visual representation of some large structure. For instance, one can stitch several overlapping aerial pictures to obtain the map of an entire city. Another example is the creation of panoramic pictures, which consists of stitching pictures taken by rotating the camera. The registration problem is also present in the medical fields. It is required to register medical imagery in order to accurately perform diagnostics as well as detect changes during several types of treatments.

With such a broad spectrum of applications, one can expect the registration problem to be presented with countless particularities. The signals being registered can have an arbitrary amount of dimensions; they might not even contain the same representation of the structure being mapped in the first place. The data from an MRI (representing soft tissues) can be registered with CT scans (more sensitive to high density objects); one may also attempt to register satellite imagery with cartographic maps.

The objective of this article is to gently introduce how the registration problem can be addressed with a non-linear optimization framework, from the Taylor series until the formulation of the Homography Jacobian and Hessian. In this article, we will focus on rigid registration, although the concepts herein can also be applied to the non-rigid case.

Mathematical formulation

The image registration problem consists of finding the coefficients \mathbf{p}_f=\{p_0, p_1, p_2, \ldots\} of the coordinate transforming function w(\mathbf{x}, \mathbf{p}) such that

\mathbf{p}_f=\displaystyle \underset{\mathbf{p}}{\mathop{\mathrm{arg\,max}}} \; \underset{\mathbf{x} \, \in \, \mathbf{I} \cap \mathbf{I^*}}\sum \Psi(\mathbf{I}(w(\mathbf{x}, \mathbf{p})), \mathbf{I^*}(\mathbf{x})). \ \ \ \ \ (I)

Here is the explanation for each element in this formula:

  • \mathbf{I} and \mathbf{I^*} are the images being registered.
  • \mathbf{x} has as many dimensions as the input images. When dealing with 2D images, \mathbf{x} is a 2D vector \langle x,y \rangle.
  • The coordinate transforming function w is just a transformation for a coordinate \mathbf{x}. Some examples of such transforms are the translation, scale, rotation, sheering, etc. Although it can be arbitrarily complex, we shall focus on the Homography transform later on.
  • The vector \mathbf{p} corresponds to the parameters of the function w, and therefore contains as many elements as the number of parameters that w requires. For instance, a translation contains two parameters \langle x,y\rangle , while a rotation might be represented by a scalar \alpha and a rotation axis \mathbf{v} . We shall see another example when formulating the Homography transform later.
  • While the symbols \mathbf{I} and \mathbf{I^*} represent two images, the expressions \mathbf{I^*}(\mathbf{x}) and \mathbf{I}(w(\mathbf{x}, \mathbf{p})) represent the evaluation of the pixel intensities at coordinate \mathbf{x} of image \mathbf{I^*} and coordinate w(\mathbf{x}, \mathbf{p}) of image \mathbf{I}, respectively.
  • \mathbf{I} \cap \mathbf{I^*} is a shorthand for the intersection between the image \mathbf{I} and the transformed image \mathbf{I^*} . Therefore, the summation accounts for all coordinates \mathbf{x} that belong to the intersection between \mathbf{I} and the transformed \mathbf{I^*} . In other words, \mathbf{x} must lie within \mathbf{I^*} and w(\mathbf{x}, \mathbf{p}) must lie within \mathbf{I} .
  • \Psi is a similarity function. It is high when its parameters are similar, and low when its parameters are different. There are many ways to measure similarity between images, such as sum of absolute differences (SAD), sum of squared differences (SSD), normalized cross correlation (NCC), etc. Later on, we shall focus on the Mutual Information metric.
  • Finally, the vector \mathbf{p}_f corresponds to the parameters of w that maximize the similarity between images \mathbf{I} and \mathbf{I^*}. In other words, it defines the transformation to be applied to image \mathbf{I} that will maximize its alignment to image \mathbf{I^*}.

Note that the image similarity function is expected to have a lowest upper bound. In other words, there is a maximum theoretical possible value that the similarity function may return; the images can’t simply be more similar than that. There might be the case in which many different transform parameters make the similarity function return that value, but no transformation parameter should get you beyond that level of similarity.

Newton Method for non-linear optimization

In order to understand the Newton algorithm for non-linear optimization, consider the 1D example function below:

similarity_function

As our goal is to reach the global maximum of any arbitrary similarity function f , it seems prudent to first have a practical tool for detecting maxima points in the first place. One such a tool consists of looking for points whose slope is zero:

minima_and_maxima

Since the slope of f is given by its derivative with respect to the parameters being optimized, df/dx , the maxima points can be found as the solution to the following system:

\frac{df(x)}{dx}=f'(x)=0

In many practical applications, including the image registration problem, the solution to this system can’t be found analytically. This is where we use Newton’s method for finding roots of differentiable functions.

The idea of Newton’s method for solving g(x)=0 is to iteratively approach the solution x , starting at any initial guess x_0 and skipping to a better point x_1 , which is given by the intersection between the line that is tangent to g(x_0) and the horizontal axis, as shown in the figure below.

newton_find_root

Since this line is tangent to g(x_0) , its slope must be g'(x_0) . As a line, its equation is given by y = g'(x_0)x + b . Since it evaluates to g(x_0) at x=x_0 , then b=g(x_0)-g'(x_0)x_0 . Finally, since x_1 is located at the intersection between this line and y=0 , we can find x_1 :

\begin{aligned} 0 &=g'(x_0)x_1+g(x_0)-g'(x_0)x_0, \\ x_1&=\frac{g'(x_0)x_0-g(x_0)}{g'(x_0)}=x_0-\frac{g(x_0)}{g'(x_0)}. \end{aligned}

This very same procedure can be repeated to compute an x_2 , by using x_1 as its own initial guess, and an x_3 after that. Generalizing, each new point x_2 in the iterative scheme can be determined by

\begin{aligned} x_{i+1}=x_i-\frac{g(x_i)}{g'(x_i)}, \end{aligned}

until some convergence criteria is satisfied (for instance, until |g(x_i)| < \epsilon ).

Since g(x)=f'(x) , we can finally reach the formulation to each iteration of the Newton’s method for finding a local minimum of f(x) , starting at an arbitrary initial guess x_0 :

\begin{aligned} x_{i+1}=x_i-\frac{f'(x_i)}{f''(x_i)}. \ \ \ \ \ (II) \end{aligned}

There are a few caveats to this method, though. Notice how the example function f contains both minima and maxima, and the slope of f is zero at all of them. The Newton method may converge to any of these inflection points, so you must check upon convergence whether you’ve reached a maximum or a minimum inflection point by verifying the signal of f''(x_f) , where x_f is the convergence point. If f''(x_f)>0 , then you’ve reached a local minimum; if f''(x_f)<0 , then you have found a local maximum. If, however, f''(x_f)=0 , then the test is inconclusive and x_f might not even be either maximum nor minimum! For instance, consider the function f(x)=x^3 , evaluated at x=0 : this inflection point is not interesting at all for our purpose.

Taylor Series: Approximating functions by polynomials

In the next section, we will talk about the geometrical interpretation of the Newton method. Before we do so, let’s back up a little and recall the concept of Taylor Series. Assuming that the function f is infinitely differentiable at x=0 , we can write it as an infinite polynomial:

f(x) =  a_0 + a_1x + a_2x^2 + \ldots

The coefficients of such polynomial can be computed by taking the derivatives of f and evaluating them atx=0 :&& \omit\rlap{\ldots} \\

\begin{aligned} a_0 &= f(0), \\ f'(x) &= a_1 + 2a_2x + 3a_3x^2+\ldots \implies f'(0) = a_1, \\ f''(x) &= 2!a_2 + 3!a_3x + \ldots \implies f''(0) = 2a_2, \\ f'{^n}'(x) &= n!a_n + \ldots \implies f'{^n}{^\prime}(0) = n!a_n. \end{aligned}

Therefore, we have

\begin{aligned}f(x) &= f(0) + f'(0)x + f''(0)\frac{x^2}{2!} + f'''(0)\frac{x^3}{3!} + \ldots, \\ f(x) &= \sum_{i=0}^\infty f'{^i{^\prime}}(0)\frac{x^i}{i!}. \end{aligned}

Let’s take a moment to contemplate the implications of this concept. It is clear that the Taylor series will represent our arbitrary function f when all of its infinite terms are present. Nevertheless, it is interesting to see what happens in the case when only its first n terms are used to describe an approximation of f via a finite polynomial g . When n=1 , g(x) = f(0) and g is only a horizontal line that crosses f at x=0 .

taylor_series_1

When n=2 , however, something interesting starts to happen. Since in this case g(x) = f(0) + f'(0)x , then we can say that the derivative g'(x) has the same value of the derivative f'(x) when x=0 . That is, g is the tangent of f at x=0 .

taylor_series_2

By using this very same reasoning, we can say that every additional term included in our Taylor series will make g smoothly approximate f at x=0 down to an additional derivative. In other words, we can find a polynomial of arbitrary degrees that shall smoothly approximate the original function f at x=0 .

taylor_series_3

Naturally, the real power of the Taylor series comes when it is adapted to perform such approximation at any arbitrary location of the x axis, not just the origin. To achieve this adaptation, we need a polynomial \tilde{g} such that

\begin{aligned} \tilde{g}(k) &= f(k), \\ \tilde{g}'(k) &= f'(k), \\ \tilde{g}''(k) &= f''(k), \\ & \omit\rlap{\ldots} \\ \tilde{g}'{^n}{^\prime}(k) &= f'{^n}{^\prime}(k). \ \ \ \ \ (III) \\ \end{aligned}

By making \tilde{g}(x) = a_0 + a_1(x-k) + a_2(x-k)^2 + \ldots and exploring the relationships in (III) , we can find a_n by computing the n-th derivative of g as follows:

\begin{aligned} \tilde{g}(k) &= a_0 & \implies a_0 &= f(k), \\ \tilde{g}'(k) &= a_1 & \implies a_1 &= f'(k), \\ \tilde{g}''(k) &= 2a_2 & \implies a_2 &= \frac{f''(k)}{2}, \\ && \omit\rlap{\ldots} \\ \tilde{g}'{^n}{^\prime}(k) &= n!a_n & \implies a_n&=\frac{f'{^n}{^\prime}(k)}{n!}.\\ \end{aligned}

Finally, we have that the polynomial of degree n that smoothly approximates f at x=k is given by

\begin{aligned} \tilde{g}(x) = \sum_{i=0}^n f'{^i}{^\prime}(k) \frac{(x-k)^i}{i!}. \end{aligned}

Geometrical Interpretation of the Newton method

Once again, let f be a function whose minimum (or maximum) we want to find. Also, let x_0 be the initial guess that will be used by the Newton’s method.

Each iteration of the Newton’s method corresponds to finding the second degree Taylor series \tilde{g} that approximates f at x=x_0, and defining the root of the next iteration x_1 as the inflection point of \tilde{g} . Mathematically,

\begin{aligned} g(x) =& \ f(x_0) + f'(x_0)(x-x_0) + f''(x_0)\frac{(x-x_0)^2}{2}, \\ g'(x_1) =& \ 0 \implies f'(x_0) + f''(x_0)(x_1-x_0)=0, \\ \implies & x_1=x_0-\frac{f'(x_0)}{f''(x_0)}, \end{aligned}

which generalizes exactly as shown in equation (II) .

This slideshow requires JavaScript.

Formulations to Iterative Image Registration

There are several ways to adapt the equation (I) for use with Newton’s method. Note that, due to the iterative nature of this method, we are always seeking to determine an incremental change to the parameters being optimized \mathbf{p} , which will bring us closer to the maximum of the image similarity function. The calculation of such incremental change to \mathbf{p} involves several extremely computationally expensive tasks, such as calculation of image gradients, Jacobian and Hessian of the transform function (evaluated at each pixel!), and so forth.

Therefore, we need to formulate our iterative scheme carefully to avoid as much calculations as possible. Below, I briefly introduce the most common ways to perform this task.

Forward additive

The forward additive formulation is possibly the simplest approach, and combines the classical formulation of Newton’s method with the most basic Taylor series formulation, which approximates the function at the origin (that is, at x=0 ). The idea here is that, at the i -th iteration, the origin of the similarity function is re-centered at \mathbf{p}_i :

\begin{aligned} f(\mathbf{p}_i, \Delta \mathbf{p}) = \sum_{\mathbf{x}} \Psi(\mathbf{I}(w(\mathbf{x}, \mathbf{p}_i+\Delta \mathbf{p})), \mathbf{I}^*(\mathbf{x})). \end{aligned}

Note that the warp function is now expressed as w(\mathbf{x}, \mathbf{p}_i + \Delta \mathbf{p}) . Here, \Delta \mathbf{p} is the variable, and \mathbf{p}_i is a constant. Therefore, the Taylor series of the image similarity function expanded at \Delta \mathbf{p}=0 will approximate it at \mathbf{p}_i .

At each iteration, we will find a \Delta \mathbf{p} that brings us closer to the desired maximum, and then we find \mathbf{p}_{i+1} \gets \mathbf{p}_i + \Delta \mathbf{p} .

By using this formulation, we usually end up having to calculate some very convoluted Jacobian and Hessian matrices for the transform function w. The Forward compositional formulation was proposed to simplify this procedure a little bit.

Forward compositional

The idea of the forward compositional formulation is to use function composition to describe how to incrementally modify \mathbf{p}_i in order to reach \mathbf{p}_{i+1}:

\begin{aligned} w(\mathbf{x}, \mathbf{p}_{i+1}) = w(w(\mathbf{x}, \Delta \mathbf{p}), \mathbf{p}_i). \end{aligned}

In order to be able to use this formulation, the image similarity function must be also reformulated:

\begin{aligned} f(\mathbf{p}_i, \Delta \mathbf{p}) = \sum_{\mathbf{x}} \Psi(\mathbf{I}(w(w(\mathbf{x}, \Delta \mathbf{p}),\mathbf{p}_i)), \mathbf{I}^*(\mathbf{x})). \end{aligned}

The nice thing about this formulation is that when elaborating the first and second derivatives of the image similarity function f (which are required for the Taylor series expansion, as described previously), you will eventually have to evaluate the Jacobian and Hessian matrices of w(\mathbf{x}, \Delta \mathbf{p}) at \Delta \mathbf{p}=\mathbf{0} . There are two good things about this: The first is that the analytical derivation of these matrices can be greatly simplified. Secondly, at least part of these matrices can be precomputed before running the iterations of Newton’s method.

All things considered, the forward compositional formulation may make our job a little easier, but the computational complexity of the optimization algorithm remains essentially unchanged. Depending on how you model your optimization framework, you can drastically decrease the computational complexity by using the Inverse compositional approach.

Inverse compositional

The problem with the previous formulations is that, when computing the first and second derivatives of f with respect to \Delta \mathbf{p} , we usually end up with formulas that depend on \mathbf{p}_i . Since \mathbf{p}_i is updated at each iteration, we end up having to recompute all terms that depend on it as well.

One way to improve this situation is to modify the image similarity function so that the reference image \mathbf{I}^* is warped by \Delta \mathbf{p} , so depending on the similarity function, we end up being able to precompute a large amount of matrices:

\begin{aligned} f(\mathbf{p}_i, \Delta \mathbf{p}) = \sum_{\mathbf{x}} \Psi(\mathbf{I}(w(\mathbf{x}, \mathbf{p}_i)), \mathbf{I}^*(w(\mathbf{x}, \Delta \mathbf{p}))). \end{aligned}

This adaptation requires the update step to also be reconsidered. Since \Delta \mathbf{p} now describes a transform that brings \mathbf{I}^* closer to \mathbf{I} , we need to adjust the update step in order to find \mathbf{p}_{i+1} given \mathbf{p}_i and \Delta \mathbf{p}.

This can be done as follows. We know that applying \Delta \mathbf{p} to \mathbf{I}^* and \mathbf{p}_i to \mathbf{I} must be equivalent to applying \mathbf{0} to \mathbf{I}^* and \mathbf{p}_{i+1} to \mathbf{I} , that is,

\begin{aligned} \Psi(\mathbf{I}(w(\mathbf{x}, \mathbf{p}_i)), \mathbf{I}^*(w(\mathbf{x},\Delta \mathbf{p}))) &= \Psi(\mathbf{I}(w(\mathbf{x}, \mathbf{p}_{i+1})), \mathbf{I}^*(w(\mathbf{x}, \mathbf{0})))\\ &=\Psi(\mathbf{I}(w(\mathbf{x}, \mathbf{p}_{i+1})), \mathbf{I}^*(\mathbf{x})). \ \ \ \ \ (IV) \end{aligned}

To visualize this, compare the before and after a step of the Newton’s method. The point \mathbf{p}_{i+1} corresponds to the optimum of the parabola that passes through \mathbf{p}_i and can be reached via a step of \Delta \mathbf{p}. To determine how this step is applied, let’s replace \mathbf{x} by w^{-1}(\mathbf{x}, \Delta \mathbf{p}) in the left side of equation (IV):

\begin{aligned} & \ \Psi(\mathbf{I}(w(w^{-1}(\mathbf{x}, \Delta \mathbf{p}), \mathbf{p}_i)), \mathbf{I}^*(w(w^{-1}(\mathbf{x}, \Delta \mathbf{p}),\Delta \mathbf{p}))) \\ =& \ \Psi(\mathbf{I}(w(w^{-1}(\mathbf{x}, \Delta \mathbf{p}), \mathbf{p}_i)), \mathbf{I}^*(\mathbf{x})). \ \ \ \ \ (V) \end{aligned}

Now compare the result of equation (V) with the right side of equation (IV) and we see that the update step can now be written as:

\begin{aligned} w(\mathbf{x}, \mathbf{p}_{i+1}) = w(w^{-1}(\mathbf{x}, \Delta \mathbf{p}), \mathbf{p}_i) \end{aligned}

Further reading

I’ve only barely scratched the surface on the topic of formulations to the image registration problem. Each one of them have their own requirements in terms of what types of transform functions w are supported. If these requirements are met, they are all proven to be equivalent. The curious reader is invited to read the paper Lucas-Kanade 20 Years On: A Unifying Framework, by S. Baker and I. Matthews.

At this point, some concepts evoked in that paper may still sound like black magic to the reader, especially those who have not studied the fields of real analysis. For now, the most important are the requirements for the transform w:

  • The forward compositional formulation requires the “set of warps to form a semigroup“. This means that the composition of transforms must be associative – ((w \circ w) \circ w) = (w \circ (w \circ w)); and they must obey the closure axiom, that is, the result of a composition of warps must also be a warp. Finally, although the paper doesn’t explicitly say so, there must also be the identity warp parameter \mathbf{p}=\mathbf{0} with which the warp does not modify a point – that is, w(\mathbf{x}, \mathbf{0}) = \mathbf{x} .
  • The inverse compositional formulation has all the requirements from the forward compositional formulation, as well as an additional requirement: The warp functions must also be invertible (and that inverse must also be a valid warp function). Obeying all these axioms is what makes it a group.

Final thoughts

This was the first post of a series about how to solve the image tracking problem via a non-linear optimization framework. In this post, I introduced the general idea of the iterative Newton’s method and briefly talked about how an arbitrary image similarity function can be adapted to be used with this framework. In the next post, I’ll dive into the details of one example, using the inverse compositional formulation combined with the Mutual Information similarity function.

It is clear that the Newton’s method is sensitive to the initial guess provided to it. Depending on the value provided to \mathbf{p}_0, there is always a risk of convergence to local maxima of the image similarity function. There are several ways to overcome these obstacles: For instance, one can use a robust estimator such as a keypoint based registration to determine \mathbf{p}_0. Furthermore, it is possible to perform the tracking algorithm in several scales of the input images, from the coarser to the finest, using the output of one scale as the input to the next scale. This has been shown in the literature to vastly improve both the convergence region as well as the convergence rate of the iterative algorithms.

Nevertheless, depending on how you model your solution, there are probably better methods than Newton’s out there, such as Gauss-Newton and Levenberg-Marquardt.  The Lucas-Kanade modelling of the image registration problem has been shown to work much better using these algorithms.

Advertisements

Configuring GDB ImageWatch to visualize custom buffer types

demo_window

gdb-imagewatch is a tool that can be used in conjunction with GDB to visualize in-memory buffers during debug sessions. In the most typical scenario, it’s used to visualize the contents of an OpenCV image being processed by the debugged application. To help with the debugging task, it has many features inspired by the Visual Studio ImageWatch plugin, such as pixel value visualization, support for multiple buffer types and dumping buffers to files; while also containing other helpful features such as buffer rotation.

Since the gdb-imagewatch plugin was created with extensibility in mind, it is very easy to adapt it so that it can support as many buffer types as needed by the programmer. In this post, I will show how to adapt the plugin so that it can display the contents of a custom C++ Image class.

Let’s begin with the sample code to be debugged:

// Compile this code with a C++11 compliant compiler; on
// gcc, use the -std=c++11 flag.
#include <iostream>

class Image {
public:
  Image(int width, int height, int channels) :
  width_(width), height_(height), channels_(channels) {
    contents_ = new unsigned char[width*height*channels];
  }

  // Forbid duplicating references to the same pointer
  Image(const Image&) = delete;
  Image& operator=(const Image&) = delete;

  // Swap pointers in case of move semantics
  Image(Image&& other) {
    *this = std::move(other);
  }
  Image& operator=(Image&& other) {
    width_ = other.width_;
    height_ = other.height_;
    channels_ = other.channels_;
    contents_ = other.contents_;

    other.contents_ = nullptr;

    return *this;
  }

  unsigned char& operator()(int row,
                            int col,
                            int channel) {
      return contents_[(row*width_ + col) * channels_ +
                        channel];
  }

  ~Image() {
    delete[] contents_;
  }

  int width() {
    return width_;
  }

  int height() {
    return width_;
  }

  int channels() {
    return width_;
  }

  unsigned char* contents() {
    return contents_;
  }

// The visibility of the fields is irrelevant to the
// plugin
private:
  unsigned char* contents_;
  int width_;
  int height_;
  int channels_;
};

int main() {
  Image gradient(50, 50, 1);
  for(int row = 0 ; row < gradient.height(); ++row) {
      for(int col = 0; col < gradient.width(); ++col) {
          gradient(row, col, 0) = col;
      }
  }

  return 0;
}

This sample defines a very simple Image class that behaves similarly to a single_ptr, and has no Region of Interest (ROI) functionality. In order to support it, we need to modify the script resources/giwscripts/giwtype.py. There are two functions in this file that need to be adapted in order for our new Image class to be recognized by gdb-imagewatch.

The first one is is_symbol_observable(symbol). When GDB pauses (e.g. when it hits a breakpoint), it will pass control to gdb-imagewatch, which in turn will inspect for all visible symbols (variables) for that particular context. The parameter symbol is a GDB object with information of that symbol (like name and type), and the function is_symbol_observable must return True if the received symbol belongs to a buffer type (and False otherwise). For the Image class presented in this example, the function must check if the type of the given symbol is either an Image or a pointer to one.

The second function is get_buffer_metadata(obj_name, picked_obj, debugger_bridge). Given that is_symbol_observable decided that any particular symbol was a buffer, the gdb-imagewatch plugin must also receive the relevant, buffer specific data for that symbol. This data can be obtained by fetching the values of the Image class fields from picked_obj, which can be done by dealing with picked_obj as if it were a dict. The function get_buffer_metadata(obj_name, picked_obj, debugger_bridge) must return a dict containing the elements below, in the following order:

  • display_name: Name of the buffer as it must appear in the ImageWatch window. Can be customized to also show its typename, for instance.
  • buffer: Pointer to beginning of buffer data
  • width: Buffer width. Must be int.
  • height: Buffer height. Must be int.
  • channels: Number of channels. Must be int.
  • type: Numeric type of each element contained in the buffer. Must be one of the following: GIW_TYPES_UINT8, GIW_TYPES_UINT16, GIW_TYPES_INT16, GIW_TYPES_INT32, GIW_TYPES_FLOAT32, GIW_TYPES_FLOAT64
  • row_stride: Number of pixels that exist between two vertically adjacent pixels. Also known as row step. Must be int.
  • pixel_layout: String describing how internal channels should be ordered for display purposes. The default value for buffers of 3 and 4 channels is 'bgra', and 'rgba' for images of 1 and 2 channels. This string must contain exactly four characters, and each one must be one of 'r', 'g', 'b' or 'a'. Repeated channels, such as 'rrgg' are also valid.

For the sample Image class described above, this is what the script resources/gdbiwtype.py should look like:

# -*- coding: utf-8 -*-

"""
This module is concerned with the analysis of each
variable found by the debugger, as well as identifying
and describing the buffers that should be plotted in the
ImageWatch window.
"""

from giwscripts import symbols


def get_buffer_metadata(obj_name,
                        picked_obj,
                        debugger_bridge):
  buffer = debugger_bridge.get_casted_pointer('char',
    picked_obj['contents_'])
  if buffer == 0x0:
    raise Exception('Received null buffer!')

  width = int(picked_obj['width_'])
  height = int(picked_obj['height_'])

  channels = int(picked_obj['channels_'])
  row_stride = int(width/channels)
  pixel_layout = 'rgba'

  # Our Image class only has one buffer type: unsigned
  # char
  type_value = symbols.GIW_TYPES_UINT8

  return {
    'display_name': str(picked_obj.type) + ' ' +
                    obj_name,
    'pointer': buffer,
    'width': width,
    'height': height,
    'channels': channels,
    'type': type_value,
    'row_stride': row_stride,
    'pixel_layout': pixel_layout
  }


def is_symbol_observable(symbol):
  """
  Returns true if the given symbol is of observable type
  (the type of the buffer you are working with). Make
  sure to check for pointers of your type as well
  """
  symbol_type = str(symbol.type)
  # In practice, you should use regular expressions to
  # handle all type modifiers gracefully
  return symbol_type == 'Image' or \
         symbol_type == 'Image *'

After making any changes to resources/giwscripts/giwtype.py, you must rebuild/reinstall the plugin in order to the changes to make effect.

Final thoughts

There’s nothing preventing you from making more sophisticated changes to the resources/giwscripts/giwtype.py script. For instance, there may be multiple buffer types that you’d like to be able to inspect simultaneously, and thus is_symbol_observable(symbol) could check for all of them, while get_buffer_info(picked_obj) could return different fields depending on picked_obj.type.

2D omnidirectional shadow mapping with Three.Js

A 2D scene is visualized by a camera placed on top of it. Black squares represent obstacles that light cannot transpose. In this scene, we have three light sources.

In this post, I will talk about how I implemented a 2D omnidirectional shadow mapping algorithm (that is, given a 2D world with obstacles defined by polygon and a set of point lights, render this world by taking into account the shadows produced by the interaction between the lights and the obstacles). You can check out a live demo here (WebGL support is required to run it).

Although I’ve already implemented 3D shadow maps for spot lights in the past, the present case opens up space for exciting new ways of approaching this problem. As far as I know, Three.Js’ shadow implementations are specifically designed for 3D scenes. Moreover, I haven’t seen any references to omnidirectional shadow maps in this library (which doesn’t mean it’s not there; it might just not be documented). Therefore, instead of using the default implementation of shadow maps, I decided to code my own and explore the new possibilities I had in mind, including the use of a 1D depth map.

A naive approach

Let’s start by the way it should not be done. The first idea that came to my mind was to use a single 1D depth map, where I would project the world into a circular 1D buffer, as if the camera was a circle inside the 2D plane. The equation for this is very simple: Given the light position \mathbf{l} and a vertex \mathbf{p}, its projected coordinate on such a buffer would be given by

\mathbf{p}'=\frac{\arctan(\mathbf{p}_y-\mathbf{l}_y,\mathbf{p}_x-\mathbf{l}_x)}{2\pi}.

This formula would go into the vertex shader, where it would receive the polygon vertices \mathbf{p} and forward the projected vertex \mathbf{p}' to the rasterizer. The GPU would, then, interpolate straight lines between the projected vertices (note that, for this to work, the models had to be rendered in wireframe mode, which I’ll talk about later) and the fragment shader would output the depth of each pixel to a framebuffer, which would later be used to determine whether or not a point in the map world lit.

Although very simple, this approach has a fundamental problem. As we are trying to model a circular entity (a hypothetical circular camera) with a non-circular buffer, the discontinuity between the edges is not something our GPU is aware of. This means that polygons that are projected in the discontinuous region of our circular buffer will be incorrectly rasterized, as shows the image below.

From cube mapping to triangle mapping

Another way to model a omnidirectional camera is by performing the 2D equivalent of the cube mapping algorithm: By breaking down the “field of view” of the light into several sub-regions and modelling each one as a perspective camera. Similarly to the circular buffer, we only need to render each sub-region into a 1D buffer. Also, given that performance was a concern, I decided to divide the light into only three different regions, since each sub-region will require one render pass.

Each perspective projection was defined by a Three.Js PerspectiveCamera object with a horizontal fov of 120°, and having the \mathbf{z} vector as its “up” vector. To move the lights around, we translate these “cameras” along with it. Since they cover the whole field of view of the light, there’s no need to rotate them. In my implementation, I set the aspect ratio to be 1:1 because the PerspectiveCamera constructor receives the vertical fov as its parameter, not the horizontal. Therefore, a 1:1 aspect would make both the horizontal and vertical field of view to be of 120°.

This is how I set up these cameras:

fboCameras = [
    new THREE.PerspectiveCamera(360/3, 1, 0.2, maxDistance),
    new THREE.PerspectiveCamera(360/3, 1, 0.2, maxDistance),
    new THREE.PerspectiveCamera(360/3, 1, 0.2, maxDistance)
];

// Projection matrix to be passed to each depth pass.
// Notice that all lights have the same projection matrix.
fboProjection = fboCameras[i].projectionMatrix;

for(var i = 0; i < 3; ++i) {
    fboCameras[i].position.set(0.0,0.0, 0.0);
    // These cameras are inside our plane, which lies
    // in the XY axes at Z=0. Therefore, our up vector
    // must be the Z vector.
    fboCameras[i].up.set(0,0,1);

    // Matrix that transforms points from world coordinates
    // to light coordinates (world2camera)
    fboView.push(fboCameras[i].matrixWorldInverse);
}

// Camera [0] corresponds to camera (A) in the image below
fboCameras[0].lookAt(new THREE.Vector3(0, 1, 0));
// Camera [1] corresponds to camera (B) in the image below
fboCameras[1].lookAt(
    new THREE.Vector3(-Math.sin(120*Math.PI/180),
                      Math.cos(120*Math.PI/180), 0)
);
// Camera [2] corresponds to camera (C) in the image below
fboCameras[2].lookAt(
    new THREE.Vector3(Math.sin(120*Math.PI/180),
                      Math.cos(120*Math.PI/180), 0)
);

Each camera will require one framebuffer where it will render its depth view of the world. I have created three buffers with width of 1024 each, and had their format set to RGBA (the reason for this will be explained later).

Rasterizing a flat polygon into a 1D buffer

As you might already know, a polygon has no thickness. As such, we can’t expect our polygons to be rasterized by the GPU during the light passes, since they would be projected as lines without any thickness. In order to solve this problem, we need to give thickness to our polygons, transforming their 1D projections from lines to rectangles. I’ve thought of two ways to achieve that:

  • Render the primitives in wireframe mode. This is not very effective, as line joints may not be connected after rendering.
  • Extrude the polygons. In other words, transform them into prisms.

Considering the potential problem of unconnected lines related to the first solution, I decided to extrude all my polygons (in my demo, it was just a matter of using cubes instead of squares).

To float textures or to fake float?

In a desktop implementation of OpenGL, it would be possible to make the output of each light pass to be written to a float texture, which is highly desirable as we are rendering the depth of our scene relative to each light. Unfortunately, floating point textures are not a standard webgl feature. They require the extension OES_texture_float, which might not be available on some clients.

Instead of using this extension and running the risk of making a demo that’s not playable by some people, I decided to pack the (floating point) depth into an RGBA pixel, which is conveniently 32bits wide. There is a caution that needs to be taken for this to work. From the perspective of the Fragment Shader, each component of gl_FragColor is a float; however, each of them will be converted to an unsigned byte to be stored in the RGBA texture. Moreover, these components are assumed to be in the range [0,1], which will be mapped to the [0, 255] range when converted to bytes. Therefore, in order to pack a depth value inside an RGBA texel, we need to make sure that it can be broken down into four floats, each one lying in the range [0, 1] but with enough precision that it will not lose information after converted to a byte.

An easy way to comply with the range restriction is by using the NDC z coordinate of your vertex, as it is (non-linearly) mapped to the [0,1] range. You can get this value from the gl_FragCoord variable inside your fragment shader program. Provided that a normalized depth is used, it can be broken down into four components (that will be stored as four bytes in the GPU), and can be recombined back into a single float during the scene render pass.

To pack a float as a vec4 in the fragment shader, I used this function proposed by Z Goddard:

vec4 pack_depth( const in float depth ) {
    const vec4 bit_shift = vec4( 256.0 * 256.0 * 256.0,
                                 256.0 * 256.0,
                                 256.0, 1.0 );
    const vec4 bit_mask  = vec4( 0.0,
                                 1.0 / 256.0,
                                 1.0 / 256.0,
                                 1.0 / 256.0 );
    vec4 res = fract( depth * bit_shift );
    res -= res.xxyz * bit_mask;
    return res;

}

To unpack a float that’s packed in a vec4, I took this function from ThreeJS’ shadow mapping library:

float unpackDepth( const in vec4 rgba_depth ) {

    const vec4 bit_shift = vec4( 1.0 /
                                 ( 256.0 * 256.0 * 256.0 ),
                                 1.0 / ( 256.0 * 256.0 ),
                                 1.0 / 256.0, 1.0 );
    float depth = dot( rgba_depth, bit_shift );
    return depth;

}

Second render pass: A light region for each fragment

During the scene render pass, we need to determine from which framebuffer each fragment will have its depth compared to. Just like in the shadow mapping algorithm, if the depth of a fragment is smaller than the value stored in its corresponding framebuffer, then this fragment is lit; otherwise, it is not. Given the fragment position \mathbf{f} and the light position {^w\mathbf{l}}, the index of the framebuffer that it should sample from can be determined by:

\lfloor \frac{\arctan(\mathbf{f}_y-{^w\mathbf{l}}_y, \mathbf{f}_x-{^w\mathbf{l}}_x)+7\pi/6}{4\pi/6} +1\rfloor \mod 3

This equation assumes that you are using three framebuffers (which is our case, since we have one render target for each light). It also assumes that the subdivisions of the light are setup in such a way that region (A) has index 0, (B) is 1 and (C) has index 2 in the render target vector.

The light position {^w\mathbf{l}} must be specified in world coordinates (alternatively, the fragment position \mathbf{f} could be specified with respect to the light, which means we wouldn’t need the term {^w\mathbf{l}} in the equation above). Hence, the light position in world coordinates can be obtained from the view matrix from any of the light subdivisions. Given the light position {^w\mathbf{l}} and its rotation matrix \mathbf{R} (which is actually related to the rotation of the projection plane we are dealing with; not with the light itself), the view matrix of the corresponding render target transforms a point in world coordinates \mathbf{^wp} to the light reference \mathbf{^lp} as follows:

{^l\mathbf{p}}=\mathbf{R}^{-1}\;{^w\mathbf{p}}-\mathbf{R}^{-1}\;{^w\mathbf{l}}.

This equation tells us that the translational component of the view matrix is given by \mathbf{t}=-\mathbf{R}^{-1}\;{^w\mathbf{l}}. Therefore, the light position can be computed as {^w\mathbf{l}}=-\mathbf{R}\;\mathbf{t}, where \mathbf{R} is the inverse of the rotational part of the view matrix.

Computing the light position like this certainly feels awfully complicated and unnecessary, considering that we would only need multiply the fragment position by the light’s view matrix in order to get the fragment position with respect to the light. However, if you consider that \mathbf{R} has a trivial format for the light region (A), you will notice that computing the light position is much less costly:

\mathbf{R}^{-1}= \left( \begin{array}{ccc}1 & 0 & 0 \\ 0 & 0 & 1 \\ 0 & -1 & 0 \end{array} \right), \; \mathbf{R} = \left( \begin{array}{ccc}1 & 0 & 0 \\ 0 & 0 & -1 \\ 0 & 1 & 0 \end{array} \right)

From this, we can derive {^w\mathbf{l}}=\langle \mathbf{t}_x, -\mathbf{t}_z, \mathbf{t}_y \rangle, where t is the translational part of the view matrix of the region (A).

One may argue that we still have to compute the fragment position with respect to the light in order to get its projective coordinates in the shadow map; however, as we don’t know which light region the fragment belongs to, we could end up by performing two view matrix multiplications (one to determine which light region we should use, and another to project the fragment to its actual light region).

Draw the volumetric light

The impression of having a volumetric light can be given by making the light intensity at the fragments to saturate where it’s close to the light position, and make it quickly decay to an arbitrary gray shade where we’d like to have the edge of our light source. I achieved this by making the light intensity at the fragment to be the sum of two equations: a gaussian probability density function (also known as the bell curve), which gives the light shape, and a term that’s inversely proportional to the fragment distance squared, which is inspired by how light intensity actually decays:

\mathbf{I}=\frac{1}{\sigma \sqrt{2\pi} }e^{-\frac{d^2}{2\sigma} } + \frac{\alpha}{1 + d^2}

As it decays very quickly, the blue curve is responsible for rendering the light outline, while the red curve is what mostly illuminates the environment.

Rendering multiple lights

In my first attempt to add multiple lights to the scene, I created multidimensional arrays of framebuffers and view matrices (one dimension for different lights and another for the light regions). This meant a hardcoded limit for the maximum amount of lights that could be rendered simultaneously. Also, this limit turned out to be around 5 lights in my machine, which is understandable considering the amount of framebuffers created in the process, but isn’t practical.

Therefore, I decided to have four render passes for each light: three to produce the shadow maps and one to render the scene. The scene render pass would take as input the shadow maps from its correspondent light and a buffer with the scene rendered by the previous light, on top of which it would accumulate its lighting effect. Since reading and writing to the same FBO doesn’t seem to result in a well defined behavior, I had two scene framebuffers, which I alternately used as input/output during this process.

Consider an example where there are three lights. In the first pass, the framebuffer A will be cleared to the ambient color and be given as input to the the shader responsible for rendering the scene with the first light. This scene will be written on the framebuffer B. On the second pass, the FBO B will be drawn on FBO A, and on top of it, the second light will be rendered. Finally, during the third pass, the image on FBO A will be drawn on the screen buffer, and on top of it, we will add the third and final light. Notice that on the last pass, there’s no need to draw on top of a framebuffer, as its output is the final result we want to display.

multiple_lights_diagram

Final thoughts

The implemented solution is generic in two senses: firstly, obstacles can be modeled by using any polygon, provided that they are extruded along the Z axis; secondly, there’s no theoretical limit for the amount of lights (although an ultimate limit will be imposed by the available GPU power in the form of a framerate drop-down).

In terms of shadow quality, this implementation is not concerned with smoothing the transition between shadowed and lit regions, which is still possible with a shadow mapping algorithm. Also, we can observe some undesired effects around an obstacle when the projection plane of a light region gets inside of it (which can happen when the light is either near or inside an obstacle). Although this looks like a precision issue with the depth map, I would investigate this problem further before proposing either a solution or a workaround for it.

Finally, notice that I haven’t regarded the possibility having the camera move in our scenario. Although allowing for camera motion would require some subtle motifications in the general algorithm, I assume it would be accompanied by a larger world with more obstacles, which would require some special optimizations in order to cull out irrelevant objects (which would need to take into account both the light position and the camera range of view). Also, the Z range for the projection matrix of each light region could be automatically adjusted so as to only consider relevant obstacles, reducing the effect of Z conflicts in the shadow map.

A mathematical overview of UVA 855

Lunch in Grid City, or UVA 855, is a problem created for the Portuguese regional of the ACM-ICPC. The premise of the problem is that you are given a set of coordinates S=\{\mathbf{p}_1,\mathbf{p}_2,\dots \mathbf{p}_n\}, n<=50000 in a grid, and have to find a coordinate \mathbf{c} that minimizes the sum of the Manhattan distances between all points in S and \mathbf{c}. Formally:

\mathbf{c}=\displaystyle{\arg\!\min}_{p} \displaystyle\sum_{i=1}^{n}{|\mathbf{p}_{i_x}-\mathbf{p}_x|+|\mathbf{p}_{i_y}-\mathbf{p}_y|}

A naive approach would be to compute the distances of all points in S to all possible points in the grid, and pick up the point with smaller score. This has complexity \mathcal{O}(san), where s and a are the height and width of the grid, respectively. Since 1 \leq s,a \leq 1000, this approach would obviously not work.

Simplifying the definition

The summation that defines the score for a candidate \mathbf{p} can be broken down into two separate summations. This simplifies the problem and helps us to better analyse it:

\displaystyle\mathbf{c}={\arg\!\min}_{p} \sum_{i=1}^{n}{|\mathbf{p}_{i_x}-\mathbf{p}_x|} + \sum_{i=1}^{n}{|\mathbf{p}_{i_y}-\mathbf{p}_y|}

What this equation tells us is that we can solve the problem by separately optimizing each axis, since they are independent. Let’s focus on the x axis for the sake of clarity:

neighbors

Each component of p belongs to S

The problem specification gives a valuable hint: it gives the impression that \mathbf{p} \in S. Although this is not entirely true, we can prove that \mathbf{p}_x = \mathbf{p}_{i_x} | \mathbf{p}_i \in S and \mathbf{p}_y = \mathbf{p}_{j_y} | \mathbf{p}_j \in S.

Suppose that there is a potential solution \mathbf{p} such that \mathbf{p}_x=\mathbf{p}_{i_x} | \mathbf{p}_i \not\in S. The score on this particular dimension is given by:

\displaystyle\alpha={\!\min} \sum_{i=1}^{n}{|\mathbf{p}_{i_x}-\mathbf{p}_x|} (I)

Let k be the amount of points to the left of \mathbf{p}, and l the amount to its right. If we take the solution to be the closest point on its left, the new score would be:

\displaystyle\alpha'_l=\alpha-k\Delta x_l + l\Delta x_l=\alpha-(k-l)\Delta x_l ,

where \Delta x_l is the difference between the x coordinate from the closest point to the left and \mathbf{p}_x. Here we can notice that, if k \geq l, then \alpha'_l \leq \alpha, which means the point to the left would be preferred over our hypothetical solution. Moreover, if k < l, then picking up a point to the right would give us the new (and better) score

\displaystyle\alpha'_r=\alpha-l\Delta x_r + k\Delta x_r=\alpha-(l-k)\Delta x_r ,

where \Delta x_r is the difference computed by using the closest point to the right.

From this analysis, we conclude that our search space for \mathbf{p}_x is reduced to the x coordinates of all points \mathbf{p}_i \in S. The same argument can be made for \mathbf{p}_y (where its search space is the y component of all points in S).

The solution as a median

Let \mathbf{p}_x be the x value that minimizes the distances on the x axis. Its score \alpha is computed as given by (I). If we were to pick up the left neighbor of \mathbf{p}_x, the new score would be:

\alpha'_l=\alpha-k\Delta x_l + (l+1)\Delta x_l=\alpha-(k-l-1)\Delta x_l.

Since \alpha'_l>\alpha (otherwise, \mathbf{p}_x wouldn’t be a solution), we can say that:

\alpha-(k-l-1)\Delta x_l>\alpha,\\(k-l-1)\Delta x_l<0,\\k<l+1,\\k \leq l. (II)

This tells us that the amount of points by the left of \mathbf{p} should not be larger than the amount of points by its right. Another relation between $k$ and $l$ can be found by analyzing the score \alpha'_r obtained by picking up the point to the right of \mathbf{p}:

\alpha'_r=\alpha-l\Delta x_r + (k+1)\Delta x_r=\alpha-(l-k-1)\Delta x_r.

In this case, \alpha'_r \geq \alpha. By exploring this relation:

\alpha-(l-k-1)\Delta x_r \geq \alpha,\\ (l-k-1)\Delta x_r \leq 0,\\ k \geq l-1. (III)

By combining (II) and (III), we have that l-1 \leq k \leq l. If the total amount of points n is odd, then k=l, otherwise |l-k| would be greater than 1. In the other hand, if n is even, then k can’t be equal to l. Therefore, k=l-1.

In general terms, if you have a sorted collection of x coordinates in X, the coordinate \mathbf{p}_x that gives the best score α is given by

\mathbf{p}_x=X[{\lfloor\frac{n-1}{2}\rfloor}],

considering an indexation that starts at 0. This is essentially the median of the collection X. Once again, the same argument also applies to the y coordinates.

Final thoughts

We conclude that, in order to solve the Lunch in Grid City problem, we have to:

  • Create two different vectors, one with all x coordinates and another with all y coordinates of the given points in S (referred to as the list of friends in the problem description);
  • Sort each vector;
  • Return \langle X[{\lfloor\frac{n-1}{2}\rfloor}], Y[{\lfloor\frac{n-1}{2}\rfloor}] \rangle .

This solution has complexity O(n\log n), where the bottleneck is the sorting algorithm.

Loading a skeletal animation? Linearize it!

When I first studied the principles behind skeletal animations, I decided to implement it by interpolating between poses on CPU and by having the skinning on the vertex shader for performance reasons. I was thrilled to see some computational weight being transferred from CPU to GPU, where it could be handled with heavy parallelization.

As performance was my main concern, I also tried to squeeze the most of my CPU in the pose interpolation process by avoiding conditional logic (which, in turn, avoids branch mispredictions), interpolating quaternions with LERP instead of SLERP (the animation difference is not noticeable if you have a large amount of samples), normalizing vectors with the invsqrt procedure and using function pointers to discriminate between bones that do require interpolation from the ones that do not — which, eventually, I realized was not such a good idea: although a procedure call may be decoded in the fetch pipeline stage, an indirect function call depends on an address that might not be available on cache, which could cause the pipeline to stall for a (very slow) memory fetch.

When I wrote my first implementation, I found out that there was a myriad of possible interpolation methods, which would have been chosen by the designer in the modelling software: linear, constant, bezier curves and so on. It didn’t sound like a good idea to implement all those methods, as this would clutter my code with features that could fit very well in a large game engine, but not in my time constrained demo. Also, implementing some of these techniques would defeat my purpose of having a blazing fast skeletal animation demo.

The simpler, the better

Facing this issue, I decided to approximate whatever curve the designer used to animate the model into a series of line segments, with equally spaced joints. This way, I would only have to implement a linear interpolation method, and at the same time, would be capable of loading and animating any model.

functions

As I was using Autodesk’s FBX SDK, sampling the interpolation function was straightforward, since it provides a function to sample poses from each bone. My core sampling function (which only had to be run once, after loading the model) looks like this:

for (int i = 0; i < numSamples; ++i) {
    // Fetches keyframe
    Eigen::Map<Matrix4d> pose_double((double*)
                         evaluator->GetNodeLocalTransform
                                      (node, sampleTime));
    Matrix4f pose = pose_double.cast<float>();

    // Grabs translation
    Vector3f translation = pose.block<3,1>(0,3);

    // Grabs scale
    Vector3f scale(pose.block<3,1>(0,0).norm(),
                    pose.block<3,1>(0,1).norm(),
                    pose.block<3,1>(0,2).norm());

    // Grabs rotation
    float quat_data[4];
    Map<Quaternionf> rot_quat(quat_data);
    Map<Vector4f> rot_vec4(quat_data);
    rot_quat = Quaternionf(pose.block<3,3>(0,0));

    // Pushes key frames and their corresponding time
    translations.push_back(translation);
    scales.push_back(scale);
    attitudes.push_back(rot_vec4);

    sampleTime += skipTime;
}

Interpolating between keyframes

After transforming the animation function of all bones into a sequence of linear functions, your interpolation code can be greatly simplified. This is how my implementation looks like:

// Compute current and next keyframe indices
int current_index = int(t*framerate);
int next_index = (current_index+1)%translations.size();

// Compute interpolation factor
float alpha = t*framerate-float(current_index);

// Interpolate translation
Vector3f translation = translations[current_index]+
                       (translations[next_index]-
                        translations[current_index])*alpha;

// Interpolate rotation
Vector4f attitude = attitudes[current_index]+
                    (attitudes[next_index]-
                     attitudes[current_index])*alpha;
attitude = attitude*invsqrt(attitude[0]*attitude[0] +
                            attitude[1]*attitude[1] +
                            attitude[2]*attitude[2] +
                            attitude[3]*attitude[3]);

// Interpolate scale
Vector3f scale = scales[current_index]+
                 (scales[next_index]-
                  scales[current_index])*alpha;

// Return matrix
output << 0,0,0, translation[0],
          0,0,0, translation[1],
          0,0,0, translation[2],
          0,0,0, 1;

Matrix3f rs = Map<Quaternionf>(attitude.data()).matrix();
rs.col(0) *= scale[0];
rs.col(1) *= scale[1];
rs.col(2) *= scale[2];
output.block<3,3>(0,0) = rs;

Obviously, this ignores a large amount of concepts (some of which are FBX-specific), but none fall within the scope of this article.

Final thoughts

It turns out that loading an FBX model takes quite some time, and even more so if you linearize your animations. This may also be true for many other model formats out there, including .blend. That’s one of the reasons why it’s good to have a custom model format to ship with your games, designed in such a way that you don’t need to pre-process your model before it’s ready to be rendered.

One thing I plan to study deeper is how this linearization process plays with animation blending. I currently expect this to be straightforward, but some hardcore optimization can make it a little bit tricky.

Finally, in this post I overlooked two very important libraries, Eigen and the FBX SDK. In the future, I will be writing more detailed posts about each of them.

Superior texture filter for dynamic pixel art upscaling

This article is an extension to a previously discussed topic: How to perform large texture magnification for pixelated games without aliasing between texels. My first post described a method to achieve this by assuming that, during the whole time, your polygons remain approximately at the same size after they are rendered.

From left to right, the knight is rendered at: Close distance with 15k pixels²/texels; medium distance with 49 pixels²/texels; far distance with 9 pixels²/texels. In all frames, α=0.1.

Constant pixels with variable α

The ideal value for the α parameter will depend on how many pixels are filling each texel after the texture is rendered. To have a smooth transition between texels without blurring them under extreme magnifications, nor aliasing them under small magnifications, we can have the number of pixels at the edge of each texel being constant, regardless of how far the polygon is.

teste

The derivative \frac{wdu}{dx} gives how many texels there are for each screen pixel, which is smaller than 1.0 when the texture is magnified. Therefore, if you want to have k pixels at the edge of your texels, your α must be k\frac{wdu}{dx}. Since we also have the v texture coordinate, the same will apply to it, which means α will be a 2D vector:

k\langle\frac{wdu}{dx},\frac{hdv}{dy}\rangle

This should give you an antialiased magnification that works independently of the polygon position, as shows the figure below:

left to right
From left to right, the knight is rendered at: Close distance with 15k pixels²/texels; medium distance with 49 pixels²/texels; far distance with 9 pixels²/texels. In all frames, k=0.7.

WebGL implementation

The derivatives with respect to screen space can be calculated by functions dFdx and dFdy, but in OpenGL ES/WebGL they require the extension GL_OES_standard_derivatives. As of Three.js r84, this extension can be activated by setting the derivatives property of your shader material object to true.

Our vertex shader remains the same as in our previous post:

varying vec2 vUv;

void main()
{
  const float w = 32.0;
  const float h = 64.0;

  vUv = uv * vec2(w, h);
  gl_Position = projectionMatrix * modelViewMatrix *
                vec4(position, 1.0 );
}

The few modifications will come in the fragment shader:

precision highp float;

varying vec2 vUv;
uniform sampler2D texSampler;

void main(void) {
  const float w = 32.0;
  const float h = 64.0;

  // I chose this k because it looked nice in
  // my demo
  vec2 alpha = 0.7 * vec2(dFdx(vUv.x),
                          dFdy(vUv.y));
  vec2 x = fract(vUv);
  vec2 x_ = clamp(0.5 / alpha * x, 0.0, 0.5) +
            clamp(0.5 / alpha * (x - 1.0) + 0.5,
                  0.0, 0.5);

  vec2 texCoord = (floor(vUv) + x_) / vec2(w, h);
  gl_FragColor = texture2D(texSampler, texCoord);
}

To see the WebGL demo in action, click here.

Final thoughts

We have presented a more general approach to the manual texture filter previously discussed. Despite its advantage of automatically detecting the best α for the polygon being rendered, it depends on an OpenGL ES extension that might not be available on certain devices. Add that to the performance impact of the dFdx/dFdy functions, the previous method will certainly be preferred should your polygon not change in size during your game.

It’s also important to notice that, under small magnifications, the texture can look a bit blurry depending on the value k that was chosen.

Manual texture filtering for pixelated games in WebGL

cover
Left: Linear filtering; middle: nearest neighbor sampling; right: custom texture filter. The knight is part of the Wesnoth Frankenpack. Click here to see texture filter demo in action.

If you’re writing a pixelated game that performs large magnification of textures, you’re probably using the nearest texture filter so your game will look like the knight on the middle instead of the one to the left.

The problem with nearest texel sampling is that it’s susceptible to aliasing if the texels are not aligned with the screen pixels, which can happen if you apply transformations such as rotation and shearing to your textured polygons. Ideally, we would like to have smooth transitions between neighboring texels in the final image, like the knight on the right in the image above.

Manual texture filtering

One way to achieve this result is by performing linear interpolation between texels on the edges of each texel in the fragment shader, but sampling the nearest texel everywhere else. A simple way to achieve this is by activating WebGL’s linear filtering, and playing with UV coordinates so that the graphics card will perform the actual interpolation between texels for you.

grid

We know that the texture coordinates t’ in an nearest filter can be calculated by:

\textbf{t}'=\frac{\left \lfloor{\langle w,h\rangle\textbf{t}}\right \rfloor+\langle0.5,0.5\rangle}{\langle w,h\rangle},

where w and h are the texture width and height, respectively. The offset makes our fragment shader sample at the center of each texel, which is important since we have enabled linear filtering. In order to have smooth transitions between texels, this offset should be replaced by a function that increases linearly at the margin of the texel, remains constant at its “blocky” region (with a value of 0.5) and then increases to 1.0 on the opposite margin of the texel, like this:

graph

By doing this with the UV coordinates, the video card will automatically interpolate your texels whenever they are sampled at their margins, effectively producing an anti-aliased blocky effect like the knight shown above.

Closed formula for the offset function

The offset function displayed in the plot above could be easily implemented with a bunch of conditional logic, but I personally steer from conditional statements on GLSL programs for performance and legibility reasons. Having said that, our offset function can be formulated by the sum of two clamped linear functions, illustrated below:

graphderivation

Here, x is the fractional part of the texture coordinate u after it is scaled from [0, 1] to [0, w]. That is x=\text{fract}(uw). The same logic also applies to the texture coordinate v, which leads to the following formula:

\textbf{x}_{uv}=\text{fract}(\textbf{t}\langle w,h\rangle)

Meaning of the α parameter

The value of α determines how smooth will be the transition between texels, and it must be in the range ]0, 0.5[. For α=0, the transition between texels will be crisp, since such a value leaves no room for linear interpolation — that is, the final result will be the equivalent of the nearest filter. For α=0.5, every coordinate inside the texels will be subject to linear interpolation, equivalently to just using a linear filter. The ideal value for α really depends on how stretched your textures will be: the larger the stretching, the smaller should be your α.

Ideally, your program should automatically determine the best value of α given the depth of the fragment and your camera parameters (including the canvas size), but that’s something I’ll talk about in the future.

Putting it all together

The final equation that gives us uv coordinates that smooths a magnified, pixelated texture is as follows:

\textbf{t}'=\frac{\left \lfloor{\langle w,h\rangle\textbf{t}}\right \rfloor+\textbf{x}'_{uv}}{\langle w,h\rangle},

where \textbf{x}'_{uv}=\text{clamp}(\frac{\textbf{x}_{uv}}{2\alpha},0,0.5)+\text{clamp}(\frac{\textbf{x}_{uv}+\langle 0.5, 0.5 \rangle}{2\alpha},0,0.5).

The term \langle w,h\rangle\textbf{t} can be computed on the vertex shader. If you’re on OpenGL, you could also try to use the flat modifier to disable interpolating it, and see if that gives any performance boost. In WebGL GLSL, the vertex and fragment shaders for our filter is as follows:

varying vec2 vUv;

void main()
{
  const float w = 32.0;
  const float h = 64.0;

  vUv = uv * vec2(w, h);
  gl_Position = projectionMatrix * modelViewMatrix *
                vec4(position, 1.0 );
}
precision highp float;

varying vec2 vUv;
uniform sampler2D texSampler;

void main(void) {
  const float w = 32.0;
  const float h = 64.0;

  // I chose this alpha because it looked nice in
  // my demo
  vec2 alpha = vec2(0.07);
  vec2 x = fract(vUv);
  vec2 x_ = clamp(0.5 / alpha * x, 0.0, 0.5) +
            clamp(0.5 / alpha * (x - 1.0) + 0.5,
                  0.0, 0.5);

  vec2 texCoord = (floor(vUv) + x_) / vec2(w, h);
  gl_FragColor = texture2D(texSampler, texCoord);
}

Notice that some attributes and uniforms are not declared in my code. That’s because I used THREE.JS to make my demo.

Final thoughts

The discussed method is indicated if your scene matches the following two criteria:

  • You perform large magnification of your texture: The knight texture used in the header of this article has 32×64 pixels, but was scaled up to cover a rectangle of size 264×413 pixels (before the rotation was applied). That’s the equivalent of taking each texel and making it more than 8 times bigger. In cases like this, linear filtering will just make the texture blurry, and nearest filtering might introduce unwanted aliasing.
  • Your objects undergo rotation or shearing: There’s a reason why I rotated the knight shown in the header of this article: if the texels were aligned with the screen pixels, then there would be no aliasing at all and a nearest filter would suffice for my purpose.

Update

This discussion has been extended here, where I talk about a way to automatically compute the best α independently of the polygon position.