Tuesday, November 9, 2010

Bulk inserting Django objects

I've added another Django utility member to my garden of Python, bulkops.py.

Currently, it consists of two functions, insert_many and update_many. They are intended to solve the problem of inserting a big bunch of objects in Django. If you use the standard save() or create() methods in Django, you get one SQL query per object. With thousands of objects, you'll simply drown in overhead in executing all these queries.

There are some alternatives out there, but unfortunately they are pretty complex and specific to a previous version of Django. Instead this code simple makes a list of tuples with the fields on the model and feeds that to the executemany() function in the DB API. Usage is really simple
for x in seq:
o = SomeObject()
o.foo = x
o.save()
becomes
l = []
for x in seq:
o = SomeObject()
o.foo = x
l.append(o)
insert_many(l)
It's tested to work with Django 1.2 and doesn't really touch the ORM internals so should be safe for at least some future versions.

Sadly, there's no support for associated many-to-many relationships. I actually needed that myself, but I don't think there's an efficient way of getting the primary ids when you do a bulk insert with the Python DB driver API. So one has to use a select query anyway afterwards, and then it was actually more convenient to do it by hand by rearranging the logic in the caller.

Monday, November 8, 2010

Cache busting in Django

One of the really annoying things in web development is dealing with caching. Browsers incorporate an elaborate caching scheme to speed up common browsing operations, which is a really great idea. But a side-effect is that when you release a change on a site, you risk that some people get the new HTML while their browser is still using an old CSS or Javascript file because their cache hasn't timed out yet.

It has bitten me several times during development too, some browsers have really aggressive caching schemes. Debugging problems caused by a stale CSS/JS file is just not fun.

One way to fix it is to use a framework to run the CSS and Javascript through a combiner/minifier that at the same time outputs a versioned filename, e.g. by hashing the contents. There are even a couple of Django projects going down this route to make it happen automatically.

I haven't found one I liked yet, though. You have to modify views or templates and fit your paths into the system, it doesn't help with changing images, if you use third-party code you have to modify that too, and the minification step makes it pretty hard to debug Javascript problems; even if you isolate it to live, problems happen on live too, sometimes.

Really, if we take a step back from the minify dream, the simple solution here is to append some kind of version number to the file. Then new HTML should have filenames with new version numbers so clients can't used their old cached files because the paths don't match.

I've found a really simple way to automate this transparently for a project by modifying one line in settings.py. The idea is to hook into Django with a middleware class, search the generated HTML for local URLs and replace them with URLs with version numbers. This way, we can catch all CSS and Javascript references as well as images no matter where they occur.

Since common file systems don't provide version numbers on the files they store, we'll use the modification timestamp instead. And instead of renaming files to contain the version number, we'll just append a ?_=timestamp to the URL:
  <img src="/media/images/logo.png">
becomes
  <img src="/media/images/logo.png?_=4321234">
Everything after the ? is ignored when looking up in the file system by the static file web servers I've tested so far. So it just works.

As a bonus, when this is running, you can safely modify the web server serving the static files to set an expiration date far out in the future for files requested with ?_=xxxxx. This allows an aggressive, yet perfectly safe caching scheme to be employed by any upstream cache, including the browser.

The above trick can be done in about 50 lines of Python. You can grab the little self-contained middleware module modtimeurls.py from my brand-new Garden of Python. It has been in production on several sites for over a year. The only problem I've found so far was with a PNG transparency fixer for IE6 that assumed .png files would end with .png rather than .png?_=xxxxxx.

Tuesday, November 2, 2010

Stoicism

This is philosophy in practice. One of the most interesting posts I have read this year.