Friday, April 17, 2009

Test if uploaded file is JPEG, PNG or TIFF

I've been looking at some of uploads that went wrong on YayArt lately, and it turns out that people sometimes submit images with the wrong extension, e.g. "someimage.png" when it's really a JPEG. This confuses the image backend we're using to process large images, VIPS, so it reports back an error.

I did a bit of googling, and it seems the easiest way out is to simply check the first few bytes of the file for magic numbers. So here's a bit of Python code for checking for whether the file data belongs to a JPEG, PNG or TIFF image:
def is_jpg(data):
return data[:2] == '\xff\xd8'

def is_png(data):
return data[:8] == '\x89PNG\x0d\x0a\x1a\x0a'

def is_tiff(data):
return data[:4] == 'MM\x00\x2a' or data[:4] == 'II\x2a\x00'
If the file is already on disk, you can grab the first few bytes with
f = open("somefile.jpg", 'r')
data = f.read(11)
if is_jpeg(data):
ext = ".jpg"
elif is_png(data):
ext = ".jpg"
Of course this won't test that the whole file is valid. But it's easier to do that afterwards with an image library once the extension is correct.

The magic numbers are documented in the specifications for the formats. You can also find some help for other formats in the source code of the file command on Unix systems.

Update: I'm liking this so much that I ended up putting it in a separate file and making a convenience function for getting an extension like '.jpg'. Grab the Python file here. I also added support for GIF. Here's another easy reference for magic file numbers.

Second update: I've updated the code, there was a bug detecting JPEGs from certain digital cameras that put Exif data in the first segment. Suffice to check the two first bytes of the JPEG, then the problem does not occur.

8 comments:

  1. i guess one shouldn't reinvent the wheel... why not use the "file" command for this?

    > file IMG_0019.JPG

    IMG_0019.JPG: JPEG image data, EXIF standard

    os.system("...") should do the work

    ReplyDelete
  2. Well, file is neat and I did consider it, but it won't work unless you already have the file on disk (despite my example, I would like to get the name right before I write it) and it's harder to reason about (can file crash? what can go wrong when you use os.system?).

    Also, using os.system in a web app is a bit scary, you have to double check that no user entered data can ever end up in the command, at least not unescaped.

    So that's why. :)

    I recently found out you can feed data in chunks to the Python Imaging Library, so another possibility is to feed it one chunk and see what happens.

    ReplyDelete
  3. Thank you, Ole, for a very useful piece of code! It'll be in the next version of sqlpython to allow browsing of image BLOBs straight from the database.

    As for UNIX's `file`... that's nice, but I don't believe it exists on windows, so no use for a cross-platform app!

    ReplyDelete
  4. Just what I was looking for- thanks for sharing.

    ReplyDelete
  5. You don't actually need to write the file to disk, you can just pass it thru `| file -`

    ReplyDelete
  6. Toni: that's an interesting idea. Here's a little snippet for doing it in Python:

    import subprocess
    f = open("test.jpg", 'r')
    data = f.read(11)
    p = subprocess.Popen(["/usr/bin/file", "-", "--mime-type", "-b"], stdin=subprocess.PIPE)
    print p.communicate(data)[0]
    # outputs "image/jpeg"

    ReplyDelete
  7. Translations for Ruby:

    def jpeg?(data)
    return data[0,2]=="\xff\xd8"
    end

    To read a file from disk:

    f = File.open(filename,'rb' # read binary
    data = f.read(11)
    f.close
    if jpg?(data)
    ext = ".jpg"
    end

    More magic numbers are http://www.astro.keele.ac.uk/oldusers/rno/Computing/File_magic.html

    ReplyDelete