A sed script to modify the content between HTML PRE tags

By Alvin J. Alexander, devdaily.com

A long time ago I created something I called a "Source code warehouse" that would help developers learn various programming languages by letting them easily find examples from open source programming projects from around the world. I initially did this for Java programs, and later expanded it to include source code files from other languages.

I included the source code files in between HTML <pre> and </pre> tags, and wrapped some simple content around that, but one thing I forgot to do was replace characters like <, >, and & that were included in the source code files. Unintended tags like this have a way of wreaking havoc in HTML documents, and the PHP section of the source code warehouse was by far the worst offender.

Today I fixed the PHP section of the warehouse by writing a sed script that would open a file, get all the content between the <pre> and </pre> tags, and convert those offending characters to something that wouldn't mess up my HTML pages. As a programming matter, this involves starting the changes at the opening PRE tag and stopping them at the closing PRE tag. It turns out that working with a range of lines with sed (while excluding the starting and stopping tags) was harder than I expected, but I came up with a kludge that got the job done.

The source code for the sed script I created is shown here:

/<pre>/,/<\/pre>/ {

        # first convert <pre> to OPEN_PRE and </pre> to CLOSE_PRE
        s/<pre>/OPEN_PRE/
        s/<\/pre>/CLOSE_PRE/

        # now convert all html as desired
        s/\&/\&amp;/g
        s/</\&lt;/g
        s/>/\&gt;/g

        # at the end convert my labels back to html <pre> and </pre> tags
        s/OPEN_PRE/<pre>/
        s/CLOSE_PRE/<\/pre>/

}

My solution was to grab the range of lines beginning with the <pre> tag and ending with the </pre> tag, and then modify those. But, my problem was I couldn't figure out how to grab that range without also including the first line after the <pre> tag and the last line before the </pre> tag. So I used the "temporary-swap"" kludge. I turned these HTML tags that were stuck in my pattern space (that I didn't want to convert) into non-HTML labels that I was pretty sure would be unique, then converted them back when I was done.

Specifically, I convert <pre> and </pre> to the non-HTML strings OPEN_PRE and CLOSE_PRE. Then I convert all & <, and > characters in the pattern space to their ISO-Latin name equivalents. And then at the end I change the OPEN_PRE and CLOSE_PRE labels back to <pre> and </pre>, respectively. Note that the order of these operations is very important.

Call it a hack, but it got the job done. In the end I wish I'd written a small program in Ruby, but sed has usually treated me pretty well, and this is a hack I can live with.

Now, diving in and out of hundreds of directories to run this sed script is another matter, and I'll try to cover that in another blog post.


devdaily logo