Ruminations: May 2009

Another recipe, this time for solving the problem of truncating a piece of HTML, i.e. turning "<p>Blah blah blah</p>" into "<p>Blah ...</p>". Google didn't really turn anything useful up, except for a suggestion of using a full-blown HTML parser and then simplifying the result, so I thought I would post the snippet here for Google to pick up.

The code never splits a valid tag or character entity. It should be able to cope with invalid HTML too, but note that it won't sanitize it. So for instance, if there's an unbalanced <a> in the source string, it won't fix it. Character entities are dealt with by counting them as one character.

The basic idea in the snippet is that we just skip through the string unless we encounter an opening tag. If so, we see if we can find the corresponding end tag and save it for later. When we got enough non-HTML characters, a ... is put in and any saved but not yet used end tags are added to the output.

Here's the code in Python (it's easily turned into a Django filter), I aimed for readability rather than ultra-regexp ninja tricks:

import re

tag_end_re = re.compile(r'(\w+)[^>]*>')
entity_end_re = re.compile(r'(\w+;)')

@register.filter
def truncatehtml(string, length, ellipsis='...'):
    """Truncate HTML string, preserving tag structure and character entities."""
    output_length = 0
    i = 0
    pending_close_tags = {}
    
    while output_length < length and i < len(string):
        c = string[i]
        if c == '<':
            # probably some kind of tag
            if i in pending_close_tags:
                # just pop and skip if it's closing tag we already knew about
                i += len(pending_close_tags.pop(i))
            else:
                # else maybe add tag

                i += 1
                match = tag_end_re.match(string[i:])
                if match:
                    tag = match.groups()[0]
                    i += match.end()
  
                    # save the end tag for possible later use if there is one
                    match = re.search(r'(</' + tag + '[^>]*>)', string[i:], re.IGNORECASE)
                    if match:
                        pending_close_tags[i + match.start()] = match.groups()[0]
                else:
                    output_length += 1 # some kind of garbage, but count it in
                    
        elif c == '&':
            # possible character entity, we need to skip it
            i += 1
            match = entity_end_re.match(string[i:])
            if match:
                i += match.end()

            # this is either a weird character or just '&', both count as 1
            output_length += 1
        else:
            # plain old characters
            skip_to = string.find('<', i, i + length)
            if skip_to == -1:
                skip_to = string.find('&', i, i + length)
            if skip_to == -1:
                skip_to = i + length
                
            # clamp
            delta = min(skip_to - i,
                        length - output_length,
                        len(string) - i)

            output_length += delta
            i += delta
                        
    output = [string[:i]]
    if output_length == length:
        output.append(ellipsis)

    for k in sorted(pending_close_tags.keys()):
        output.append(pending_close_tags[k])

    return "".join(output)

Ruminations

Friday, May 1, 2009

Safe truncation of HTML