Tracking the WSOD

Submitted by Frederic Marand on

At long last, I've moved the OSInet sites from shared hosting to a VDS over the 4-day Xmas period, and overall it went well.

However, with new users (dare I say bots ?) eager to refresh the site after a few hours offline, I soon had to stand about 20 hits/sec, which proved to be a bit too much on this small VDS. So I enlisted the help of APC, just like the drupal.org sysops do, and like I do on the group's intranet servers and it went really faster.

Except after a few minutes, I eventually entered the true realm of winter holidays: pure white screens, without any text on them. Also known as White Screen Of Death to the Drupal community.

The apc.php dashboard supplies with APC gave some very interesting insight into what was happening: it helped tune APC and even Apache by reducing most of the default parameters to fit them to such a smallish server, until I managed to reduce them. The most interesting point, if you track this, appears when the cache fills up: at some point, it empties the entries above the TTL duration, ready for reallocation... and one can find the site with 0 item in cache, and still a full cache on the stats !

I haven't looked into the APC code yet, but i noticed this appeared to be point at which WSODs manifested themselves, even though the server still had available RAM. Pending further examination, I suspect this means the entry lifetime set in apc.ttl has expired, but the garbage collection defined by apc.gc_ttl hasn't run yet. This one seems to be fixed at 3600, whatever it is set to in php.init, although its reference describes it as being settable there. Looks like mucho debug ahead !

Restarting Apache clears the problem, luckily, as we've learnt from the day-to-day operation of drupal.org, but I hope to find a better solution.

If you're using mod_php and APC, once you get the WSOD, you'll notice in Apache's error log that Apache segfaults for every PHP page requested. On my server, this can happen once every 1-2 weeks or sometimes it happens multiple times a day. It has been a long time bug in mod_php + any PHP accelerator, and as far as I know, there is no fix for it. You can get around it by using something like this script, or you can switch to php via fastcgi, which causes no problems with PHP op-code caching.