Tuesday, November 9, 2010

Bulk inserting Django objects

I've added another Django utility member to my garden of Python, bulkops.py.

Currently, it consists of two functions, insert_many and update_many. They are intended to solve the problem of inserting a big bunch of objects in Django. If you use the standard save() or create() methods in Django, you get one SQL query per object. With thousands of objects, you'll simply drown in overhead in executing all these queries.

There are some alternatives out there, but unfortunately they are pretty complex and specific to a previous version of Django. Instead this code simple makes a list of tuples with the fields on the model and feeds that to the executemany() function in the DB API. Usage is really simple
for x in seq:
o = SomeObject()
o.foo = x
o.save()
becomes
l = []
for x in seq:
o = SomeObject()
o.foo = x
l.append(o)
insert_many(l)
It's tested to work with Django 1.2 and doesn't really touch the ORM internals so should be safe for at least some future versions.

Sadly, there's no support for associated many-to-many relationships. I actually needed that myself, but I don't think there's an efficient way of getting the primary ids when you do a bulk insert with the Python DB driver API. So one has to use a select query anyway afterwards, and then it was actually more convenient to do it by hand by rearranging the logic in the caller.

10 comments:

  1. Very nice utility functions, tested it out and it does make a big difference in performance. Thanks!
    And what I like about it is that it doesn't really mess with ORM internals.

    Only small issue I ran into was with column names, they should be quoted with con.ops.quote_name() to avoid issues in some cases (for ex. column named 'order').

    ReplyDelete
  2. Ah, yes, the "order" bug. Thanks for reporting it, I've fixed it now.

    I did profile the code, and if I recall correctly, most of the remaining Python time was spent somewhere inside the ORM prepare-for-save code; unfortunately, it doesn't seem the Django core developers have cared a lot about performance.

    ReplyDelete
  3. This is nice. Unfortunately it doesn't work if your model uses inheritance. Any easy solutions for that?

    ReplyDelete
  4. If you've got a one-object-in-two-tables inheritance, I think it requires a bit of refactoring of the code. If it's an abstract class that doesn't work, it's probably just a bug in the code.

    ReplyDelete
  5. Hi! I've been using both functions and the difference in speed is amazing! Thank you!

    I've ran into a small problem though, while using django 1.3.1 and sqlite 3.7.8: inserts and updates were running without any errors, but no data were changed on the DB. After googleing on this issue, I found out that transactions were guilty. For reference, the problem is described here:

    http://groups.google.com/group/django-users/browse_thread/thread/7992f7a36d728ed4

    So I've created a patch that corrects this issue and gives the user exception handling for free. The patch is available here:

    http://pastebin.com/sQ9yizkb

    ReplyDelete
  6. Lucas: I had a look at your patch (they are by the way a bit easier to read if you make them with diff -u), and it seems you have simply added transaction block manually?

    I think that's orthogonal to this function, conceivably you may want bundle a call to this function with some other stuff. So unless I've misunderstood something, I think I'll just let it stay the way it is.

    ReplyDelete
  7. Right. I've changed transaction control to manual, before explicitly sending commit/rollback.

    I agree with you in that it is not a problem of the function itself, as the issue is related to the use of cursors. On the other hand, it may be useful for someone that is not using cursors in her/his app, since a call to 'insert_many' silently fails.

    So I'll leave this just for future reference and in 'diff -u' format:

    http://pastebin.com/sEeihY3h

    Thanks again for this! It's been really useful!

    ReplyDelete
  8. @Lucas, @Ole: I'm using django 1.3 here, and I was also bit by the lack-of-commit and not getting anything efectively inserted/updated. Some change should be made, either to the example code on this post (because .save() takes care of calling conn.commit_unless_managed(using), so this call would be missing), or in the code, adding explicit transaction management.

    @Ole, thanks a lot for your code!

    ReplyDelete
  9. Is this superseded by bulk_create() ?

    ReplyDelete
  10. Yes, more or less, although bulkops.py also has an update_many() which I don't think has an equivalent in Django yet. You can do an SomeThing.objects.update(foo=bar) efficiently in Django but that sets the same value on all objects, as far as I know you cannot easily update many objects to different values efficiently.

    ReplyDelete