Dokuwiki vs Google

Submitted by Frederic Marand on

For some months now, I've been noticing that the Audean wiki, which I use as a live documentation site for various aspects of my sites, appeared comparatively rarely in Google search results, although it was referenced in various places and Google cache info (cache:-prefixed queries) showed the site was indexed.

Now, the Audean Wiki is based on Splitbrain's Dokuwiki very convenient Open Source wiki, which often appears in relation with Drupal for documentation purposes, and it appears there are three problems with a default Dokuwiki installation, which prevent effective search engine optimization:

Here's how to overturn these hurdles.

Unclean URLs

By default, Dokuwiki does not make use of clean URLs, but uses an implementation-related format based on doku.php?<path> Although Google itself has been able to follow this type of links for years, other engines are not as efficient. It is therefore advisable to use turn clean URLs on in Dokuwiki. This is achieved using either Apache configuration with mod_rewrite, for instance in a .htaccess file, or the $conf['userewrite'] variable in conf/local.php. Setting this variable to 1 enables "internal" processing of URLs by the wiki engine, producing URLs without the question mark defining parameters. First hurdle crossed.

Programmer namespaces

Any medium-size wiki will take advantage of the "namespace" feature in Dokuwiki, and Audean is no exception. However, even with rewrite enabled, namespaces use a colon (":") separator between namespace and terminal path, much as one can find in programming languages (think C++ class::member syntax, for instance). This means URLs are all at the same depth level, and is not really good. So Dokuwiki enables rewriting for these colons too. This is achieved using the $conf['useslash'] variable in conf/local.php. Note this will only work if the first step of using rewrite has already been implemented.

Nonexistent existing pages

The third problem requires going deeper into HTTP, and think of what happens when a user agent (browser, search engine crawler) requests a non-existent page ? The wiki parses the clean url, finds the page to be missing and, being a Wiki engine, it offers a new page creation dialog. Fine and dandy. However, in that case, Dokuwiki considers it has actually found a page (the "new page" creation one), and returns a HTTP 200 result. To Google, and presumably other search engines as well it means the site is trying to perform spamdexing by answering on content it doesn't hold (now you know why bots sometimes request absurd-looking URLs from your site or submit irrelevant data to your forms). You'll learn this when trying to validate your site on Google's sitemap program. They even have a specific explanation pagefor this problem, along with various server-dependent solutions. In Dokuwiki's case, though, the solution is simple: this is achieved using the $conf['send404'] variable in conf/local.php. The new page creation dialog will now be returned along with a 404 status.

There is a catch

In this last case of new page creation, well-behaved browsers like Opera and Firefox have no problem with this and display the page creation page normally. However, there remains a broken browser, which ignores the RFC 2616 HTTP standard. Quoting from section 10.4, first paragraph of the RFC : User agents SHOULD display any included entity to the user.

Granted, the really proper status code semantically wouldn't be 404, but 409 Conflict. To quote RFC 2616: This code is only allowed in situations where it is expected that the user might be able to resolve the conflict and resubmit the request. The response body SHOULD include enough information for the user to recognize the source of the conflict. Ideally, the response entity would include enough information for the user or user agent to fix the problem; however, that might not be possible and is not required.. This is exactly what is happening: the user can recognize the source of the conflict (missing content) and fix it (by creating the content). But I've yet to see code 409 really be used.

Oh, well. Audean is for coders anyway, and these types don't normally use MSIE. For others, a browser detection script could be used to return the wrong results MSIE needs, and the proper results to other user agents.