· · learn more

◌  Modis USA1 averages

16 October 2010

I was going to post about CUDA here, but I got on a bit of a yak shaving expedition and ended up solving a problem I run into every few months.

Sometimes I like to average big sets of similar images. Off the top of my head, I’ve averaged Dinosaur Comics and Tower Defense maps. Another common dataset is satellite images. (I even posted one a while ago.)

Here’s an excellent day (20 September 2010) from Modis Rapid Response’s USA1 subset:


Here’s a more typical day (15 October 2010) with more clouds and a more obvious seam between satellite passes with different color balances:


This isn’t science-quality data for most purposes, but it’s enough to play with. I downloaded an entire year of it (if playing along at home, check for recent leap days!):

$ curl -O "http://rapidfire.sci.gsfc.nasa.gov/subsets/?subset=USA1.2009[289-365].terra.2km.jpg" &
$ curl -O "http://rapidfire.sci.gsfc.nasa.gov/subsets/?subset=USA1.2010[001-288].terra.2km.jpg"

and tried to average them with ImageMagick, timing it because I know this usually takes a while and I’m still getting a feel for the speed of my relatively new computer:

$ time convert -average *.jpg average.png
convert -average *.jpg average.png 13.58s user 20.90s system 4% cpu 12:07.91 total

and, as you see, killed it after it had spent more than ten minutes RAM-bound and swapping heavily. This is a little odd when you think about it. There are 365 images × 650 × 750 pixels × 3 channels per pixel × 1 byte per channel = a little over half a gigabyte if stored uncompressed but efficiently. I had enough RAM available that even if it was storing each channel in 32 bits it should have been fine. Evidently it wasn’t.

When this has come up before, I’ve usually averaged in a tree: averaging the images in groups of 10 (or groups whose sequence numbers end in the same numeral, etc.), into a smaller batch of partial averages which can then be averaged themselves. There are two problems with this: first, for numbers of images not divisible by the group size, the average will end up a little weighted. Second, ImageMagick being so slow to average is a methodological bug which no one should have to put up with in the first place. There’s no reason to load every image into RAM before averaging – if you simply keep a running average with a reasonably capacious numeric type, then of course for n images of dimensions d, the RAM use is on the order of 2d instead of nd. This soon matters for non-small n and d! (You could keep a running sum with ordinary bignums and divide it at the end, but this is generally slower and only a little more accurate for plausible sequence lengths.) Because I’m going to want it again, and for anyone doing a web search for ImageMagick average slow, here’s some quick and dirty Python:

#!/usr/bin/env python2.7 ''' Average image files using PIL and NumPy. Reads input images of identical size and bit depth. Writes a PNG. Todo: + check whether output file exists (a likely mistake) and refuse to clobber + check that input image types (e.g. 8-bit RGB) are all the same + output optionally in same format as input + quantify the drift when using smaller floats ''' from sys import argv, exit import Image from numpy import * # float128 for more accuracy and less speed, float32 for vice-versa avgtype = float64 # We can’t init the running average until we know how big the input images are. avg = array([]) n = float(len(argv)-2) # -1 for our filename, -1 for the ouput file if len(argv) < 3: exit('Usage: <input image>* <output image>. Output is a PNG.') for imgfile in argv[1:-1]: try: img = Image.open(imgfile) except: exit('Could not read "%s"!' % (imgfile)) if avg.shape == (0,): # We’re on the first image; let’s set up the average: avg = asarray(img).copy() # copy() to get its dimensions avg = avg.astype(avgtype) avg.fill(0.0) else: if img.size != (avg.shape[1], avg.shape[0]): # PIL does y,x, NumPy does x,y exit('"%s" is not the same shape as the earlier images!' % (imgfile)) avg = avg + asarray(img).astype(avgtype)/n avgimg = Image.fromarray(avg.astype(uint8)) avgimg.save(argv[-1])

In simple tests on the USA1 images, the execution time is indeed sublinear to the length of the image sequence. It averages them all in 13 seconds wall time (10 s with float32, 17 s with float128), which is about one second per minute it took me to get bored of waiting for ImageMagick, so I’m satisfied.

Enough of that. Here’s the full year’s average:


I’m interested in the way clouds and fog seem to stick to the north sides, and avoid the south sides, of points on the California and Oregon coasts.

Here are the averages for each approximate quarter of the last year (Jan–March, etc.) in reading order:


Notice for example that the Columbia River around the Gorge and back along the Oregon–Washginton border was very clear last spring, which I assume is the effect of its famous winds.

My next step, when I have a few more hours to sink into pretty pictures, is to try to pull out a simulated cloudless view by doing something like ignoring pixels above a certain luminance and interpolating through cloudy periods. That project may actually involve CUDA.