Quick news

  • 2014-03-27: MongoDB Watchdog module ported to Drupal 8 at the Szeged Dev Days.
  • 2014-01-26: My post on the Symfony web profiler in Silex selected in Week of Symfony. w00t !
  • 2013-10-18: My first commit went into MongoDB today. And, guess what ? It's in JavaScript
  • 2013-09-20 to 29: Working on Drupal 8 EntityAPI at the extended code sprints during and around DrupalCon Prague
  • 2012-08-19: Working on Drupal 8 EntityAPI at Drupalcon Munich
  • 2012-06-15: Working on Drupal 8 EntityAPI at DrupalDevDays Barcelona
  • 2012-03-23: Working on the future Drupal Document Oriented Storage at DrupalCon Denver. D8 or later ? Bets are on Later

Latest sites

  • 2014-08-18: 400% speedup in 3 weeks for http://france3-regions.francetvinfo.fr/ : who said Drupal back-offices had to be slow ?
  • 2014-02-07: Sotchi Olympics traffic not a problem for http://www.francetvsport.fr/ , which I rearchitected on Drupal 7 in 2013
  • 2013-08-04: Classified Ads 3.1-beta1 for Drupal/Pressflow 6 and Drupal 7
  • 2011-09-14: Completed migration of FranceInfo.FR from SPIP to Drupal
  • 2011-07-13: The new social network features of Le Figaro are now powered by an OSInet-designed MongoDB implementation
  • 2010-12-21: Madame Figaro brand new site by OSInet and others
  • 2010-08-16: France.FR is back online with OSInet and Typhon
  • 2010-06-15: the new France Culture, which OSInet helped reach its performance goals, is now online

Force valid HTML with valid_node module

Having non-HTML-skilled contributors input content on a Drupal site seems to often lead to invalid HTML tag soup being input. And even with seasoned coders, a HTML input error happens sometimes, which can be a problem until someone fixes the post.

So I figured I'd force valid HTML from user input, and here is the proof-of-concept valid_node module: it will force any node to be saved as a XHTML fragment.

XHTML fragments

Since the page as a whole is being built by drupal, the data input by the user is never valid XHTML on its own, since it always misses the XML prolog, head section, and body element. However, valid node content should typically be valid XHTML when wrapped in these elements.

This is exactly what the module does: wrap the node body in a prolog and epilog, validate the whole, and remove the processed prolog and epilog, leaving us with the "fragment".

For an interesting but ignored background, I suggest you look at the XML fragment interchange recommendation of the W3C for some background on the issues involved with the concept.

Let's do IT

OK, now what ? Parsing HTML on the fly within a drupal module is typically out of the question, because it is too much work. Parsing the fragment with the XML DOM PHP extension is not a good idea: it will reject any invalid construct instead of correcting them. So ?

Enter tidy. Read about it, but in short, it does just what we are looking for. It is even available as a PECL extension, but since this is not part of the drupal requirements, we can't really rely on it. However, it is also available as a command-line binary, easy to install on either Windows or Linux/UNIX, so the module uses it.

The module settings allow you define the actual command used for tidying (possibly setting the path, adding options...), and the number of lines to trim from top and bottom of post to remove the prolog and epilog and regenerate the fragment. Some options of tidy could be added to the default -asxml to forcefully remove unwanted attributes.

The module is available from my sandbox on drupal.org.

Note that it is only "proof of concept", and will, for instance, not validate dynamically generated content like PHP nodes.

UPDATE 2006-10-02: valid_node or htmltidy module ?

I just noticed the HTMLtidy module on drupal.org. Although the goal and results are not strictly identical, there have much in common.

The main differences seem to be:

  • HTMLtidy is older has has evolved longer, so it has more features
  • HTMLtidy has more settings

In both cases, the mechanism is to use hook_nodeapi(), wrap a fragment, tidy the result, and strip the wrapping. Check the sources for more details.