Force valid HTML with valid_node module

Submitted by Frederic Marand on Fri, 2006-09-29 23:38

Having non-HTML-skilled contributors input content on a Drupal site seems to often lead to invalid HTML tag soup being input. And even with seasoned coders, a HTML input error happens sometimes, which can be a problem until someone fixes the post.

So I figured I'd force valid HTML from user input, and here is the proof-of-concept valid_node module: it will force any node to be saved as a XHTML fragment.

XHTML fragments

Since the page as a whole is being built by drupal, the data input by the user is never valid XHTML on its own, since it always misses the XML prolog, head section, and body element. However, valid node content should typically be valid XHTML when wrapped in these elements.

This is exactly what the module does: wrap the node body in a prolog and epilog, validate the whole, and remove the processed prolog and epilog, leaving us with the "fragment".

For an interesting but ignored background, I suggest you look at the XML fragment interchange recommendation of the W3C for some background on the issues involved with the concept.

Let's do IT

OK, now what ? Parsing HTML on the fly within a drupal module is typically out of the question, because it is too much work. Parsing the fragment with the XML DOM PHP extension is not a good idea: it will reject any invalid construct instead of correcting them. So ?

Enter tidy. Read about it, but in short, it does just what we are looking for. It is even available as a PECL extension, but since this is not part of the drupal requirements, we can't really rely on it. However, it is also available as a command-line binary, easy to install on either Windows or Linux/UNIX, so the module uses it.

The module settings allow you define the actual command used for tidying (possibly setting the path, adding options...), and the number of lines to trim from top and bottom of post to remove the prolog and epilog and regenerate the fragment. Some options of tidy could be added to the default -asxml to forcefully remove unwanted attributes.

The module is available from my sandbox on drupal.org.

Note that it is only "proof of concept", and will, for instance, not validate dynamically generated content like PHP nodes.

UPDATE 2006-10-02: valid_node or htmltidy module ?

I just noticed the HTMLtidy module on drupal.org. Although the goal and results are not strictly identical, there have much in common.

The main differences seem to be:

HTMLtidy is older has has evolved longer, so it has more features
HTMLtidy has more settings

In both cases, the mechanism is to use hook_nodeapi(), wrap a fragment, tidy the result, and strip the wrapping. Check the sources for more details.