Experiments in Optimization

Overview

At SightMachine we are currently trying to replicate the functionality of SimpleCV in a JavaScript/CoffeeScript library we call SimpleCV.js. We are slowly attempting to reproduce all of the functionality in the python version of SimpleCV in the JavaScript version of the library. This is great except for the fact that SimpleCV builds upon the already existing OpenCV library. OpenCV has been under development for over ten years and many of the core C/C++ algorithms have been optimized ad nauseum. In SimpleCV we take the raw python bindings provided by OpenCV, clean them up, and add our own secret sauce that makes them easy to use. Unfortunately, in CoffeeScript/JavaScript land we can't make use of these routines. There are no quick and easy was to cross compile C++ to JavaScript or CoffeeScript. To make matters worse many of the C++ routines in OpenCV are highly templated; this means it is non-trivial to take a chunk of working C++ image processing source code and modify to look like valid CoffeeScript. It would appear that our only option is to re-invent the wheel to create a lot of the functionality we want in SimpleCV.js.

Our main computer vision developers are old C/C++/Python hacks, and the transition to CoffeeScript has been a great learning experience. Today we decided to dive into some code optimization to see how we should best approach improving the run time of SimpleCV.js. This is not a definitive and final optimization as our depth of knowledge on CoffeeScript/JavaScript are still rather shallow, instead this was primarily an instructive effort that we hope could spark discussion among the community.

The Problem

For many computer vision applications you want to work with a neighborhood of pixels (e.g. the pixels above, below, left, and right of a given pixel), however to avoid memory fragmentation and improve computational efficiency we represent an image as a one dimensional array. Specifically, in SimpleCV.js we represent images as one dimensional arrays of Uint8 values where each pixel is stored sequentially in red, green, blue, alpha order. As we move from the first item in the array to the last we work through the image from left to right, top to bottom. So for an image array, let's call it IMG, IMG[0] is the red value of the pixel at (0,0), IMG[1] is the green value of the pixel at (0,0), IMG[4] is the red value at (1,0), and IMG[IMG.length-1] and the alpha value for the pixel at (width-1,height-1). While this approach is great for storing data it makes it a pain to work with a neighborhood of pixels.

In our first pass at SimpleCV.js we wanted to add a morphological dilation operation. Morphological dilation is usually used on black and white (binary) images to fill in little holes and connect regions. Wikipedia gives an overly complicated description of a morphological dilation operation. The simple explanation of dilation is that for each pixel we look at the pixels above, below, left, right, and kitty corner to the pixel. In the case of a black and white binary image if any of the pixels are white, the algorithm sets the pixel to white, otherwise it sets it to black. In the case of RGB images you take the maximum value in the window for each color channel. The image below shows an input image, the thresholded binary image, and then the application of a single dilation operation.

Dilation

The SimpleCV python code to do this is as follows:

cam = Camera()
img = cam.getImage()
binary = img.binarize()
dilate = binary.dilate()
result = img.sideBySide(binary)
result = result.sideBySide(dilate)
result.show()

Here is what we would like it to look like in CoffeeScript

Camera = new require('models/Camera')
CVImage = require('models/Image')
Display = require('models/Display')
display = new Display(".livePreview")
size = display.resolution()
console.log size
Camera.beginStream((i)=>
  i = i.resize 320,240
  i = i.grayscale().binarize().dilate()
  i.show display
)

Here is a screenshot of our demo page running our test code, feel free to try it yourself:

Running code

Getting it to work

Our first pass of replicating a dilate function in SimpleCV.js just tried to get the functionality right. Here is the commit that has all of our examples. We'll probably nuke it to clean up the library. On our first pass we didn't pay much attention to the run time (optimize last right?). Also note that this is just hacked code, not production, some features aren't implemented (e.g. working on both grayscale and color images). Our first pass naive code looked like this:

  rgbxy2idx:(x,y,w,h) =>
    bpp = 4
    return (bpp*y*w)+(bpp*x)

  ptDilate:(x,y,img,w,h,offset=0)=>
    a = img[offset+@rgbxy2idx(x+1, y-1,w,h)]
    b = img[offset+@rgbxy2idx(x,   y-1,w,h)]
    c = img[offset+@rgbxy2idx(x-1, y-1,w,h)]
    d = img[offset+@rgbxy2idx(x+1, y,  w,h)]
    e = img[offset+@rgbxy2idx(x,   y,  w,h)]
    f = img[offset+@rgbxy2idx(x-1, y,  w,h)]
    g = img[offset+@rgbxy2idx(x+1, y+1,w,h)]
    i = img[offset+@rgbxy2idx(x,   y+1,w,h)]
    j = img[offset+@rgbxy2idx(x-1, y+1,w,h)]
    r = Math.max(a,b,c,d,e,f,g,i,j)
    return r

  dilate:(iterations=1,grayscale=false)=>
    if( iterations < 1 )
      iterations = 1
    border = 1
    w = @width+(2*border)
    h = @height+(2*border) 
    # We create a one pixel border to make our operation uniform across the image.
    out = @cloneWithBorder(border)
    sz = out.length
    temp = @cloneWithBorder(border)
    
    istart = border
    istop = border+@width-1
    jstart = border
    jstop = border+@height-1
    # now go through the image, row first
    for k in [1..iterations]
      for j in [jstart..jstop] #Y
        for i in [istart..istop] #X
          out[@rgbxy2idx(i,j,w,h)]=@ptDilate(i,j,temp,w,h) # get the dilation at the point and store it
          out[1+@rgbxy2idx(i,j,w,h)]=@ptDilate(i,j,temp,w,hoffset=1) # do this for every channel
          out[2+@rgbxy2idx(i,j,w,h)]=@ptDilate(i,j,temp,w,h,offset=2) # yes this is for the RGB case
      #temp = out
    return @cropBorderCopy(out,border) # get rid of the border and return the results

First Pass Improvements

Using the chrome JavaScript profiler/debugger and our [example application] (http://demo.simplecv.org/), we noticed that this code was as slow as molasses in January. What could we do to speed it up? Do we need all the overhead of all of those function calls? The functions ptDilate and rgbxy2idx make things easier to read and maintain, but they get called an awful lot. They create a lot of stack frames that need to be allocated and then deleted. Let's see what removing them does for us. Here is the revised code.

  dilate2:(iterations=1,grayscale=false)=>
    if( iterations < 1 )
      iterations = 1
    border = 1
    w = @width+(2*border)
    h = @height+(2*border) 
    out = @cloneWithBorder(border)
    sz = out.length
    temp = @cloneWithBorder(border)
    
    istart = border
    istop = border+@width-1
    jstart = border
    jstop = border+@height-1

    bpp = 4

    for k in [1..iterations]
      for j in [jstart..jstop] #Y
        for i in [istart..istop] #X
          for offset in [0..2]
            a = temp[offset+(bpp*(j-1)*w)+(bpp*(i+1))]
            b = temp[offset+(bpp*(j-1)*w)+(bpp*(i  ))]
            c = temp[offset+(bpp*(j-1)*w)+(bpp*(i-1))]
            d = temp[offset+(bpp*(j)*w)+  (bpp*(i+1))]
            e = temp[offset+(bpp*(j)*w)+  (bpp*(i  ))]
            f = temp[offset+(bpp*(j)*w)+  (bpp*(i-1))]
            g = temp[offset+(bpp*(j+1)*w)+(bpp*(i+1))]
            h = temp[offset+(bpp*(j+1)*w)+(bpp*(i  ))]
            l = temp[offset+(bpp*(j+1)*w)+(bpp*(i-1))]
            r = Math.max(a,b,c,d,e,f,g,h,l)
            out[offset+(bpp*j*w)+(bpp*i)] = r
        #temp = out
    return @cropBorderCopy(out,border)

Running this code versus the original version showed a noticeable increase in performance. We are still getting familiar with Chrome's javascript profiler but from what we can tell it looks like the dilate2 operation is almost 3 times faster! Here is our profiler output for a short run of a few different code snippets.

Profiler output

But wait, there's more!

So our second attempt was a lot better, but we noticed that we are looking up a lot of values repeatedly. For the area to the left of the target pixel, that is the pixels NW,N,W,SW,S of our pixel, we look them up three times in rapid succession. What happens if we modify our code to avoid these look ups? We will call this the student body left approach. Basically for the first pixel on a row we look up all the values, for each successive pixel on the row we shift over the values we found on the last iteration and just look up the next three pixels on the right. Here is the code for that:

   dilate3:(iterations=1,grayscale=false)=>
    if( iterations < 1 )
      iterations = 1
    border = 1
    w = @width+(2*border)
    h = @height+(2*border) 
    out = @cloneWithBorder(border)
    sz = out.length
    temp = @cloneWithBorder(border)
    
    istart = border
    istop = border+@width-1
    jstart = border
    jstop = border+@height-1

    bpp = 4
    a = 0; b = 0; c = 0; d = 0; e = 0; f = 0; g = 0; h = 0; l = 0; r = 0
    for k in [1..iterations]
      for offset in [0..2]
        for j in [jstart..jstop] #Y
          for i in [istart..istop] #X
            if i == istart 
              a = temp[offset+(bpp*(j-1)*w)+(bpp*(i-1))]
              b = temp[offset+(bpp*(j)*w)+  (bpp*(i-1))]
              c = temp[offset+(bpp*(j+1)*w)+(bpp*(i-1))]
              
              d = temp[offset+(bpp*(j-1)*w)+(bpp*(i  ))]
              e = temp[offset+(bpp*(j)*w)+  (bpp*(i  ))]
              f = temp[offset+(bpp*(j+1)*w)+(bpp*(i  ))]
                         
              g = temp[offset+(bpp*(j-1)*w)+(bpp*(i+1))]
              h = temp[offset+(bpp*(j)*w)+  (bpp*(i+1))]
              l = temp[offset+(bpp*(j+1)*w)+(bpp*(i+1))]

              r = Math.max(a,b,c,d,e,f,g,h,l)
              out[offset+(bpp*j*w)+(bpp*i)] = r
            else
              # student body left
              a = d
              b = e
              c = f
              d = g
              e = h
              f = l
              g = temp[offset+(bpp*(j-1)*w)+(bpp*(i+1))]
              h = temp[offset+(bpp*(j)*w)+  (bpp*(i+1))]
              l = temp[offset+(bpp*(j+1)*w)+(bpp*(i+1))]
              r = Math.max(a,b,c,d,e,f,g,h,l)
              out[offset+(bpp*j*w)+(bpp*i)] = r             
        #temp = out
    return @cropBorderCopy(out,border)

Hrm. Sad Trombone. Our run time for dilate3 was only slightly faster (roughly 4% faster than dilate2). Looks like JavaScript probably does a pretty good job of caching our array in memory and therefore memory look-ups are trivial. We peeked at the JavaScript that came out of our CoffeeScript code. Here is what it looked like:

     Image.prototype.dilate3 = function(iterations, grayscale) {
        var a, b, border, bpp, c, d, e, f, g, h, i, istart, istop, j, jstart, jstop, k, l, offset, out, r, sz, temp, w;
        if (iterations == null) iterations = 1;
        if (grayscale == null) grayscale = false;
        if (iterations < 1) iterations = 1;
        border = 1;
        w = this.width + (2 * border);
        h = this.height + (2 * border);
        out = this.cloneWithBorder(border);
        sz = out.length;
        temp = this.cloneWithBorder(border);
        istart = border;
        istop = border + this.width - 1;
        jstart = border;
        jstop = border + this.height - 1;
        bpp = 4;
        a = 0; b = 0; c = 0; d = 0; e = 0; f = 0; g = 0; h = 0; l = 0; r = 0
        for (k = 1; 1 <= iterations ? k <= iterations : k >= iterations; 1 <= iterations ? k++ : k--) { 
          for (offset = 0; offset <= 2; offset++) {
            for (j = jstart; jstart <= jstop ? j <= jstop : j >= jstop; jstart <= jstop ? j++ : j--) {
              for (i = istart; istart <= istop ? i <= istop : i >= istop; istart <= istop ? i++ : i--) {
               // 8>< ----------------SNIP -------------------
               // What is this nonsense JavaScript? 
               // these should just be vanilla for loops right?
              }
            }
          }
        }
        return this.cropBorderCopy(out, border);
      };

When our CoffeeScript gets translated into JavaScript some of the for loops look a bit wonky. All of those conditional checks and updates might be costing us precious cycles. We modified our code to see if we could cajole CoffeeScript into producing something more human and VM friendly. We also noticed that we did the w*bpp calculation quite a bit. We moved that to a single variable called bppw. The result looks like this:

  dilate4:(iterations=1,grayscale=false)=>
    if( iterations < 1 )
      iterations = 1
    border = 1
    w = @width+(2*border)
    h = @height+(2*border) 
    out = @cloneWithBorder(border)
    sz = out.length
    temp = @cloneWithBorder(border)
    
    istart = border
    istop = border+@width-1
    jstart = border
    jstop = border+@height-1

    bpp = 4
    a = 0; b = 0; c = 0; d = 0; e = 0; f = 0; g = 0; h = 0; l = 0; r = 0
    bppw = bpp*w
    k = 1
    while k <= iterations
      offset = 0
      while offset <= 2
        j = jstart
        while j < jstop # y
          i = istart
          while i < istop # x

            if i == istart 
              a = temp[offset+(bppw*(j-1))+(bpp*(i-1))]
              b = temp[offset+(bppw*(j))+  (bpp*(i-1))]
              c = temp[offset+(bppw*(j+1))+(bpp*(i-1))]              
              d = temp[offset+(bppw*(j-1))+(bpp*(i  ))]
              e = temp[offset+(bppw*(j))+  (bpp*(i  ))]
              f = temp[offset+(bppw*(j+1))+(bpp*(i  ))]
                         
              g = temp[offset+(bppw*(j-1))+(bpp*(i+1))]
              h = temp[offset+(bppw*(j))+  (bpp*(i+1))]
              l = temp[offset+(bppw*(j+1))+(bpp*(i+1))]

              r = Math.max(a,b,c,d,e,f,g,h,l)
              out[offset+(bppw*j)+(bpp*i)] = r
            else
              # student body left
              a = d
              b = e
              c = f
              d = g
              e = h
              f = l
              g = temp[offset+(bppw*(j-1))+(bpp*(i+1))]
              h = temp[offset+(bppw*(j))+  (bpp*(i+1))]
              l = temp[offset+(bppw*(j+1))+(bpp*(i+1))]
              r = Math.max(a,b,c,d,e,f,g,h,l)
              out[offset+(bppw*j)+(bpp*i)] = r
            i = i + 1
          j = j + 1
        offset = offset + 1
      k = k + 1
        #temp = out
    return @cropBorderCopy(out,border)

The resulting java script looked like this:

        while (k <= iterations) { // I guess this is better
          offset = 0;
          while (offset <= 2) { // not really all that readable
            j = jstart;
            while (j < jstop) {
              i = istart;
              while (i < istop) {
                  // 8>< ----------------SNIP -------------------
                i = i + 1;
              }
              j = j + 1;
            }
            offset = offset + 1;
          }
          k = k + 1;
        }

Sad Trombone. This code didn't get us that much of a speedup either. A measly 0.91% speedup compared to dilate3. We took another look at the code and noted that in reality we were passing over the image array three times, one time for each of the r, g, and b pixel values. What if we unrolled this loop and did all three channels at the same time? Now this is a completely naive implementation hacked together using the miracle of find and replace:

  dilate5:(iterations=1,grayscale=false)=>
    # 8>< ----------------SNIP -------------------  
    a0 = 0; b0 = 0; c0 = 0; d0 = 0; e0 = 0; f0 = 0; g0 = 0; h0 = 0; l0 = 0; r0 = 0;
    a1 = 0; b1 = 0; c1 = 0; d1 = 0; e1 = 0; f1 = 0; g1 = 0; h1 = 0; l1 = 0; r1 = 0;
    a2 = 0; b2 = 0; c2 = 0; d2 = 0; e2 = 0; f2 = 0; g2 = 0; h2 = 0; l2 = 0; r2 = 0;

    bppw = bpp*w
    k = 1
    while k <= iterations
      j = jstart
      while j < jstop # y
        i = istart
        while i < istop # x
          if i == istart 
            a0 = temp[0+(bppw*(j-1))+(bpp*(i-1))]
            a1 = temp[1+(bppw*(j-1))+(bpp*(i-1))]
            a2 = temp[2+(bppw*(j-1))+(bpp*(i-1))]
            
            b0 = temp[0+(bppw*(j))+  (bpp*(i-1))]
            b1 = temp[1+(bppw*(j))+  (bpp*(i-1))]
            b2 = temp[2+(bppw*(j))+  (bpp*(i-1))]
            
            c0 = temp[0+(bppw*(j+1))+(bpp*(i-1))]
            c1 = temp[1+(bppw*(j+1))+(bpp*(i-1))]
            c2 = temp[2+(bppw*(j+1))+(bpp*(i-1))]

            d0 = temp[0+(bppw*(j-1))+(bpp*(i))]
            d1 = temp[1+(bppw*(j-1))+(bpp*(i))]
            d2 = temp[2+(bppw*(j-1))+(bpp*(i))]
            
            e0 = temp[0+(bppw*(j))+  (bpp*(i))]
            e1 = temp[1+(bppw*(j))+  (bpp*(i))]
            e2 = temp[2+(bppw*(j))+  (bpp*(i))]
            
            f0 = temp[0+(bppw*(j+1))+(bpp*(i))]
            f1 = temp[1+(bppw*(j+1))+(bpp*(i))]
            f2 = temp[2+(bppw*(j+1))+(bpp*(i))]

            g0 = temp[0+(bppw*(j-1))+(bpp*(i+1))]
            g1 = temp[1+(bppw*(j-1))+(bpp*(i+1))]
            g2 = temp[2+(bppw*(j-1))+(bpp*(i+1))]
            
            h0 = temp[0+(bppw*(j))+  (bpp*(i+1))]
            h1 = temp[1+(bppw*(j))+  (bpp*(i+1))]
            h2 = temp[2+(bppw*(j))+  (bpp*(i+1))]
            
            l0 = temp[0+(bppw*(j+1))+(bpp*(i+1))]
            l1 = temp[1+(bppw*(j+1))+(bpp*(i+1))]
            l2 = temp[2+(bppw*(j+1))+(bpp*(i+1))]
  
            r0 = Math.max(a0,b0,c0,d0,e0,f0,g0,h0,l0)
            r1 = Math.max(a1,b1,c1,d1,e1,f1,g1,h1,l1)
            r2 = Math.max(a2,b2,c2,d2,e2,f2,g2,h2,l2)
            out[0+(bppw*j)+(bpp*i)] = r0
            out[1+(bppw*j)+(bpp*i)] = r1
            out[2+(bppw*j)+(bpp*i)] = r2
          else
            # student body left
            a0 = d0
            b0 = e0
            c0 = f0
            d0 = g0
            e0 = h0
            f0 = l0
            g0 = temp[0+(bppw*(j-1))+(bpp*(i+1))]
            h0 = temp[0+(bppw*(j))+  (bpp*(i+1))]
            l0 = temp[0+(bppw*(j+1))+(bpp*(i+1))]
            r0 = Math.max(a0,b0,c0,d0,e0,f0,g0,h0,l0)
            out[0+(bppw*j)+(bpp*i)] = r0

            a1 = d1
            b1 = e1
            c1 = f1
            d1 = g1
            e1 = h1
            f1 = l1
            g1 = temp[1+(bppw*(j-1))+(bpp*(i+1))]
            h1 = temp[1+(bppw*(j))+  (bpp*(i+1))]
            l1 = temp[1+(bppw*(j+1))+(bpp*(i+1))]
            r1 = Math.max(a1,b1,c1,d1,e1,f1,g1,h1,l1)
            out[1+(bppw*j)+(bpp*i)] = r1

            a2 = d2
            b2 = e2
            c2 = f2
            d2 = g2
            e2 = h2
            f2 = l2
            g2 = temp[2+(bppw*(j-2))+(bpp*(i+2))]
            h2 = temp[2+(bppw*(j))+  (bpp*(i+2))]
            l2 = temp[2+(bppw*(j+2))+(bpp*(i+2))]
            r2 = Math.max(a2,b2,c2,d2,e2,f2,g2,h2,l2)
            out[2+(bppw*j)+(bpp*i)] = r2
                                                
          i = i + 1
        j = j + 1
      k = k + 1
        #temp = out
    return @cropBorderCopy(out,border)

Sad Trombone!!!1This actually made things worse! We actually slowed down the performance by close to 30%. Perhaps juggling all of those values local variables around in the virtual machine causes cache misses. This is probably all for the best as that code is ugly as sin.

Results Summary

Just for completeness here our raw values from the chrome javascript profiler. We calculated the speed increase between two algorithms a and b as ((a-b)/b)*100. We generally concluded that eliminating all but the most necessary function calls greatly increased performance, however, small tweaks like making sure that trying to cache re-used variables only eek out a bit more performance. All the other "optimizations" seems to have either little effect or a negative effect.

Method	Self (s)	Total (s)	Speedup versus Naive Algorithm	Speedup Versus Prior
dilate	1.36	26.78	N/A	N/A
dilate2	5.33	6.9	288%	N/A
dilate3	5.06	6.62	304%	+4.23%
dilate4	4.90	6.56	308%	+0.91%
dilate5	8.10	9.73	175%	-32.58%

Notes on Methods

This is a completely naive attempt at optimization in JavaScript. The code here is just an experiment and not really meant for production. The comments are sparse, the variable names suck, and a lot of features are left un-finished. If you have an input on how to improve SimpleCV.js we would love to hear from you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!