2006
09.23

Every webmaster that is concerned with standards compliance has to constantly keep their site validated. I strive to keep this site compliant with the XHTML 1.0 Strict specification and that means a trip to the W3C validator every time I make a change to the site. One of the constant annoyances with the XHTML Strict specs are character set issues and closing unary tags. I have eased my burden a little by putting in a plugin to my blog software to fix those problems right before the page is rendered. That means even if I mis-type something in a blog post it won’t keep the site from validating, because it gets fixed at display-time.

Here is what the meat of the plugin looks like:


  ##: Tag fixes
  $$body_ref =~ s/\<br([^\/]*?)\>/\<br$1\/\>/gi;
  $$body_ref =~ s/\<hr([^\/]*?)\>/\<hr$1\/\>/gi;
  
  ##: Character code fixes
  $$body_ref =~ s/\&#8217;/\&\#8217\;/gi;
  $$body_ref =~ s/\x93/\&\#8220\;/gi;
  $$body_ref =~ s/\x94/\&\#8221\;/gi;  
  $$body_ref =~ s/\x92/\&\#8217\;/gi;
  $$body_ref =~ s/\x91/\&\#8217\;/gi;
  $$body_ref =~ s/\x85/\.\.\./gi;
  $$body_ref =~ s/\x96/-/gi;

You can get the gist of it I hope. The $$body_ref variable just contains the contents of the current blog post that is being assembled. Each line performs a regex search/replace operation and corrects a potential markup problem. The top 2 lines fix the 2 most common problems I run into with unary tag closing. They add closing forward-slash’s to the br and hr html tags. Those are the ones I forget most often. You might need to include the img tag also if it’s a problem for you.

The character code fixes are needed to fix those “non SGML character number” messages that annoy the crap out of everybody. They usually sneak in when you do a copy/paste operation off of a webpage and stick it in your blog post. If you’re not familiar with regex syntax, the x00 in the search string is an octal notation. It means to replace every occurence of that character code with the appropriate html entity name that makes it compliant with XHTML. In my opinion, HTML entity names are the easiest and most readable route to take in this situation. The ones I listed above are the most common for me, but you can find a more complete list here.

This is not all just to make you feel better about being an open standards fanboy. It has a real purpose. For instance, if your site is listed on one of the standards compliance list sites like W3CSites, you don’t want to get delisted just because you forgot to close a br tag on one of your blog posts. This just adds another layer of protection against that happening.

Switch to our mobile site