Ruminations: Safe truncation of HTML

Friday, May 1, 2009

Safe truncation of HTML

Another recipe, this time for solving the problem of truncating a piece of HTML, i.e. turning "<p>Blah blah blah</p>" into "<p>Blah ...</p>". Google didn't really turn anything useful up, except for a suggestion of using a full-blown HTML parser and then simplifying the result, so I thought I would post the snippet here for Google to pick up.

The code never splits a valid tag or character entity. It should be able to cope with invalid HTML too, but note that it won't sanitize it. So for instance, if there's an unbalanced <a> in the source string, it won't fix it. Character entities are dealt with by counting them as one character.

The basic idea in the snippet is that we just skip through the string unless we encounter an opening tag. If so, we see if we can find the corresponding end tag and save it for later. When we got enough non-HTML characters, a ... is put in and any saved but not yet used end tags are added to the output.

Here's the code in Python (it's easily turned into a Django filter), I aimed for readability rather than ultra-regexp ninja tricks:

import re

tag_end_re = re.compile(r'(\w+)[^>]*>')
entity_end_re = re.compile(r'(\w+;)')

@register.filter
def truncatehtml(string, length, ellipsis='...'):
    """Truncate HTML string, preserving tag structure and character entities."""
    output_length = 0
    i = 0
    pending_close_tags = {}
    
    while output_length < length and i < len(string):
        c = string[i]
        if c == '<':
            # probably some kind of tag
            if i in pending_close_tags:
                # just pop and skip if it's closing tag we already knew about
                i += len(pending_close_tags.pop(i))
            else:
                # else maybe add tag

                i += 1
                match = tag_end_re.match(string[i:])
                if match:
                    tag = match.groups()[0]
                    i += match.end()
  
                    # save the end tag for possible later use if there is one
                    match = re.search(r'(</' + tag + '[^>]*>)', string[i:], re.IGNORECASE)
                    if match:
                        pending_close_tags[i + match.start()] = match.groups()[0]
                else:
                    output_length += 1 # some kind of garbage, but count it in
                    
        elif c == '&':
            # possible character entity, we need to skip it
            i += 1
            match = entity_end_re.match(string[i:])
            if match:
                i += match.end()

            # this is either a weird character or just '&', both count as 1
            output_length += 1
        else:
            # plain old characters
            skip_to = string.find('<', i, i + length)
            if skip_to == -1:
                skip_to = string.find('&', i, i + length)
            if skip_to == -1:
                skip_to = i + length
                
            # clamp
            delta = min(skip_to - i,
                        length - output_length,
                        len(string) - i)

            output_length += delta
            i += delta
                        
    output = [string[:i]]
    if output_length == length:
        output.append(ellipsis)

    for k in sorted(pending_close_tags.keys()):
        output.append(pending_close_tags[k])

    return "".join(output)

6 comments:

Greg AllardMay 1, 2009 at 10:24 PM
This is useful. I just tried it with this

test = " go to http://ole-laursen.blogspot.com/2009/05/safe-truncation-of-html.html "

{{test|urlize|truncatehtml:25}}

And it worked how expected.
ReplyDelete
Replies
chatrinFebruary 5, 2010 at 5:56 PM
thanks for the tips....
ReplyDelete
Replies
Nemesis DesignMarch 31, 2010 at 10:09 PM
Thanks, works nicely. Wonder why is not in the default template tags
ReplyDelete
Replies
ZülfüAugust 8, 2023 at 4:24 AM
yurtdışı kargo
resimli magnet
instagram takipçi satın al
yurtdışı kargo
sms onay
dijital kartvizit
dijital kartvizit
https://nobetci-eczane.org/
FKNİBD
ReplyDelete
Replies
ErdemAugust 18, 2023 at 3:48 AM
salt likit
salt likit
dr mood likit
big boss likit
dl likit
dark likit
O1ZSNL
ReplyDelete
Replies
LeonardJuly 22, 2024 at 3:18 AM
This is a great postt thanks
ReplyDelete
Replies

Add comment