Ruminations: October 2007

Sunday, October 21, 2007

Making foreach useful in the corner cases

One of the nice things about recent developments in mainstream programming languages is improved support for looping.

In a functional language with proper support for lambda functions, it might not be a problem but if you're programming in something a bit more mainstream, chances are that you're writing loops over data structures all the time. In my experience, it turns out to be a real improvement if the language offers a bit of syntactic sugar and allows you to write something like:

for element in datastructure 
    do something with element

Python's got it, C# and Java have it, last I heard there were plans to introduce it in C++ too.

Until recently I thought this particular piece of syntactic sugar was of the kind which is really useful in 90% of the cases and completely useless in the remaining 10%. For example, what if you want to extract two elements and not just one? Or loop over two datastructures? Or only loop over certain elements?

In fact, there are elegant solutions to all these problems. I've been pondering over them this week. Here's how you do in Python, but all this requires of the language is the concept of iterators and some kind of convenient support for tuples.

Extracting only part of the sequence: Slice the datastructure with a function that skips the parts you don't want.

Python has the [start:end] operator which is as succinct as it gets for extracting an interval. There's also a more generic islice function in the itertools module which works with iterators:

from itertools import islice
mylist = [1, 2, 3, 4, 5]
for x in islice(mylist, 1, 4):
    print x
(... prints 2, 3, 4)

This idea can be extended in many ways, e.g. if you need every second element, it's easy to come up with a function that only iterates over them. If you need to exclude elements based on something related to the elements themselves and not their position in the datastructure, there's always the filter function.

Extracting elements from several datastructures: Construct an iterator that iterates over the sequences in parallel, returning a tuple with one element from each.

Python has the built-in zip function which returns the complete sequences zipped up as a list of tuples. Until Python 3 is out, izip in itertools is closer to what we want:

from itertools import izip
xlist = [1, 2, 3]
ylist = ["a", "b", "c"]
for x, y in izip(xlist, ylist):
    print x, y
(... prints 1 a, 2 b, 3 c)

So it's no problem. We can still maintain our nice, succinct for loop, there's no need to manually iterate over each sequence in turn. Note that zip and izip will automatically stop when one of sequences runs out of data.

Extracting several elements from the same datastructure: Construct an iterator that returns a sliding window of the sequence as a tuple.

You could code this manually, but it's even easier to do once you know zip: just zip the datastructure with itself, with a little offset. In Python, that's zip(seq, seq[1:]), or more generally:

from itertools import izip, islice
def slidingwindow(seq, windowsize):
    t = ()
    for i in range(windowsize):
        t += (islice(seq, i, None),)
    return izip(*t)

That's all you need to get two or more elements out at the same time:

mylist = [1, 2, 4, 8]
for prev, val in slidingwindow(mylist, 2):
    print val - prev
(... prints 1, 2, 4)

No need to manually set prev to the predecessor in each iteration. You could easily add padding or let the window advance in larger steps if needed. The latter is probably easiest to do by combining the function with an appropriate slicing operation.

Which brings me to the final point. An important part of the beauty of these operations is that they can be combined in a multitude of ways.

Note: if you're going to use slidingwindow in production and want something to paste in, here's an even more general version that works with arbitrary iterator input:

from itertools import izip, tee
def slidingwindow(iterable, windowsize):
    t = tee(iterable, windowsize)
    for j in range(1, windowsize):
        for i in range(j, windowsize):
            try:
                t[i].next()
            except StopIteration:
                pass
            
    return izip(*t)

The tee function duplicates an iterator by storing the data returned by it until all its duplicates have passed it too. The code above advances each of the returned iterators by the correct amount. It seems it's a bit faster to use tee than doing the equivalent stuff by hand.

Tuesday, October 16, 2007

Landmark

Today I have lived for a quarter of a century. I was born in 1982. According to Wikipedia, it was the year the first computer virus escaped into the wild.

Friday, October 5, 2007

Django-rendertext 0.2 and a new jquery plot plugin

A quick update. I've released django-rendertext 0.2. The most notable change is support for GIF output, mostly intended for the browser that didn't support PNGs properly until very recently.

I deliberately built rendertext with PNGs first because it's the format of the future. IE7 usage is picking up (see for instance here or here), within a year or two it's probably completely dominant among IE users. Now that I whipped up the GIF support, I'm glad I didn't start with it because it was a real PITA to do with PIL (Python Imaging Library). PIL doesn't have proper documentation of the corner cases that turn out to be the most important and only sketchy support for palettes.

I've also been working on a plotting plugin for jQuery. We've not been entirely satisfied with Plotr which otherwise appeared to be the prettiest and most advanced Javascript-only plotting library out there. It's for Prototype and not for jQuery but most importantly I've found it confusing to hack on. And it's completely missing the interactive aspects.

So I have started my own project. So far I think I've discovered that the hardest part of a library for plotting is not actually drawing the stuff, but layouting and auto-adjusting the axes. In any case, in two days I've gotten pretty far. Here's a screenshot (click to get the full version):

What's missing right now is support for plotting points and columns, options for controlling the plot, legends, support for time and the interactive stuff such as zooming, highlighting graphs, tooltips for points on mouse hover. Also, the auto-adjusting algorithm is still a bit rough but I think I managed to crack it on paper last night so I just need to implement and test it.

In other news, from Werner Vogels' blog here's how Amazon manages storage (warning! scientific paper ahead). Now that's as far as you can get from N=1 thinking. As an aside, I'm fond of that guy's name, Werner Vogels (note for English native speakers: in German the surname is pronounced like [fogels] not [wogels]). It's just like der Vogelfänger bin ich ja I guess, nice rhythm.