Static Page Caching for Drupal 4.7

| | | | |

This weekend I tackled coding up a Drupal feature I’ve been sorely missing on many a past project: static page caching. My yet-to-be-named module is a replacement for Drupal 4.7’s built-in caching. Instead of storing the pre-generated cached pages in the database, the module stores them in a cache directory on the file system.

Big deal, right? Actually, as I’ll demonstrate below, it makes all the difference.

What Drupal’s built-in caching does is just cut down on the code that needs to run on each page request, while reducing the database access to a single query that retrieves the cached page to display. This in itself provides a marked speed-up, sure enough, but still necessitates invoking PHP on each page request, and what’s worse, opening a connection to the backend database. Those connections become a scarce resource once the going gets tough.

In contrast, my cache module exports Drupal pages into plain old static HTML files. When a page request can be satisfied from the static cache, PHP is bypassed in its entirety, and the web server serves the cached file straight from the disk at whatever ultimate top speed it is capable of (most HTTP servers these days, including behemoth Apache, are able to saturate a 100 Mbit/s pipe when serving static files from an adequate server box).

Here’s what this looks like in practice, from a quick benchmark on my PowerBook (1.67GHz PowerPC G4). The yellow bars represent page requests served from Drupal’s standard, database-backed cache storage, and the bluish bars shooting waaay to the right of them are how many page requests can be served when we throw off the yoke of PHP and shed our dependance on SQL:

Page requests per second (mean)

As you can see, even in this trivial benchmark, the performance boost was more than an order of magnitude.

(This was a makeshift benchmark on an underpowered laptop. The benchmark consisted of timing a Drupal installation’s 16K front page using Siege. This was with Drupal 4.7.1, PHP 4.4.1, Apache 2.2.2 and MySQL 4.1, the latter three binaries as compiled and installed from DarwinPorts, all running pretty much with their stock settings. Additionally, I threw eAccelerator 0.9.5b2 into the mix as well, to give PHP performance a hand, since without it, Drupal’s database-backed caching had trouble completing the higher concurrency levels.)

Requests per second isn’t the whole story, either. The above chart doesn’t show the detrimental increase the standard caching experienced with regards to page response times (i.e. the time the visitor has to wait for a page to finish loading) as the load factor grew. In the chart below, shorter bars signify faster page loads:

Page response time in seconds (mean)

The best thing about caching a Drupal site as static pages is that it’s dead simple, in every respect: it’s trivial to setup, easy to manage, and facile to scale. Throw in a good lightweight web server (Lighttpd comes to mind), and you’ll scale to the very limits of your hardware, with few arbitrary road blocks to hold you back. That’s not the case with Drupal’s database caching, which requires you to tune the MySQL configuration, and more often than not, mess around with operating system settings, such as file descriptor limits, to handle even moderate traffic.

As a testament to the above, it took me over an hour to configure the MySQL 4.1 daemon on my laptop in order to enable the database-backed cache to even pass the 100x concurrency level benchmark. Before I figured out how to adjust the measly limit of 256 file descriptors that Mac OS X gives processes by default, every other page request was failing with Drupal complaining of not being able to connect to the database. Unless you happen to have stocked away some sysadmin experience, a situation like this could have you scratching your head for a bit.

What’s worse, on a shared host mucking around with system-level stuff is often not even possible nor permitted. Many hosts also impose implicit, arbitary limits on the CPU time that you are allowed to consume on a daily basis, meaning that if you run a popular site with a dynamic CMS like Drupal, you may be facing a forced upgrade to a dedicated server.

Serving out static pages, on the other hand, is what web servers do best. You can run even a very popular site, as a static version, off a shared host without receiving that dreaded e-mail from the billing department.

Come to think of it, with this module, you might e-mail them for a discount since you wouldn’t even be using, or needing, your “fair share” of server resources. Better yet, depending on what kind of site you run, you might install Drupal on your own computer, downgrade to a cheaper hosting account that doesn’t provide PHP & MySQL support in the first place, and just upload the generated HTML files via FTP or rsync. The static page caching is multisite-compatible, meaning you could keep a single, private “master” copy of Drupal to manage and publish all your sites.

So, what’s the catch, you ask? Well, the usefulness of static page caching really depends directly on what kind of site you run; namely, how “dynamic” your site is. Obviously, if you make use of forms or features that submit information from the visitor back to Drupal (e.g. comments, the feedback module, etc.), your site can’t be exported into a 100% static format, since something still needs to handle receiving the form submissions (or to put it technically, POST requests are not cached).

A similar example would be if you have, for instance, a quotes block and consider it very important that it updates with a random quote on every page request, instead of every 5 minutes (say), as would be the case using static caching.

Other than the above considerations, the module’s first and foremost current limitation, shared with Drupal’s standard caching, is that only pages served to anonymous visitors are cached; requests from logged-in users are passed through to Drupal in a normal fashion. This means that if you run a large community where users need to register and login to participate, most of the user interaction will still need to be dynamic. (I have figured out a way to keep a user-specific cache, as well, but I’m not yet sure the implementation is worth the effort or complexity. If you think otherwise, please leave a comment to that effect, or drop me a private e-mail.)

What the static page cache is ideal for, then, are personal blogs, corporate sites, portals, directories & the like, where most of the site is targeted at anonymous users, and only the occasional feature (say, posting a comment or sending feedback) will need to bypass the cache and be handled by Drupal. For these kind of sites, you can probably benefit from static caching in over a good 95% of your site.

I have a long list of client sites running Drupal 4.6 that I’ll be upgrading to 4.7 as soon as possible in order to let them benefit from this module. But first, I will need to test the module on a couple different shared hosts, and of course, implement it on this site. I’ll polish the module up for general distribution soon after I complete the above. (For the time being, if you wish to give it a spin, please drop me a line.)

Till then, here are a couple of screenshots to tide everyone over:

The extended cache configuration in Drupal's settings screen.
The extended cache configuration in Drupal’s settings screen.

The administrative interface for the Ajaxy cache rebuild process.
The administrative interface for the Ajaxy cache rebuild process.

Update (2006/06): some additional details are available from my post to the Drupal development mailing list.

Update (2006/10): there’s now a project site and issue tracker on drupal.org. Please note that the appropriate place to post support requests is in the issue tracker, not as comments to this page (support request comments will get summarily ignored).

Update (2007/07): Justin Miller has a great write-up about using Boost on Drupal 5.x, with lotsa technical details and a very cool logo to boot.

Update (2007/08): Boost is now available for Drupal 5.x, too. Many thanks to Alexander Grafov for the initial porting work.

Ahh nice and something I always want in any CMS. I was thinking of this lately and have another point to add:

AJAX is a nice way to throw in dynamic content while keeping most content statically generated and cached. Before I try to explain that think of this : most content on any site with a CMS like Drupal will have content that will become static : like news (once occured does not change in content), or description or almost anything. What becomes dynamic is thus the content in what Drupal calls the “blocks”. This is something that I have seen in almost every site. There is a neat distinction in a webpage’s main content (news article for example) and its dynamic blocks: ads, who’s online, latest poll, etc.

Now since this makes it clear to understand what I am thinking of : cache (static HTML) everything and let the blocks come through AJAX. Now the data for the AJAX blocks that need dynamic data can simply be come from non-cached datasource. This mix will work for almost any type of site, but the module is complex.

Of course your concept is lot simpler for implementation and valuable for many sites, but most people probably use Drupal type CMS to get the dynamic touch.

This is just in a concept form and I haven’t yet thought as deeply about it… but someday …. :)

I am currently working on the GData module as an SoC 2006 participant.
Cheers !!!

Thanks for you comment, Sumit; I think that’s a very good idea that you outlined.

I had already been thinking of implementing Drupal’s commenting system with Ajax, but this could certainly be applied to Drupal’s blocks as well. With keep-alive HTTP connections, it should not have a very detrimental effect on the server, either. Certain types of blocks, like shoutboxes, are in fact anyway better-suited for asynchronous use in the first place.

However, I’m not yet sure how much less of a strain it would be to render a Drupal block, instead of a whole, themed Drupal page. Since blocks can make use of full Drupal functionality, I think Drupal might need to be fully bootstrapped in order to serve out even just a simple dynamic block, such as the Quotes on this site. But this certainly merits some experiments.

Congrats on getting accepted to Google’s Summer of Code! I had a look on your blog, and GData is definitely something I’ll need to read up on. May your endeavor be successful!

Interesting stuff. I sent a reply to the developer list.

For a sample of AJAX-y / Static Drupal go to: http://new.savannahnow.com/share/ and look for the ‘Neighborhood Blogs’ widget in the lower left of the page.

My generator.module uses Curl to grab pages from Drupal and fwrite them as .php files on the web server. Some details here:
http://ken.therickards.com/2006/06/04/tech-notes/

(See item #1 under ‘new stuff’).

I output .php in order to check for dynamic $_GET and $_POST requests and to duplicate some of Drupal’s session handling functions to check for user status. We serve flat pages to all users, though, and let JavaScript handle user customization on these pages.

-Ken

Hi Arto, please be aware of this static file cache project we are planning to commit to Drupal 4.8.

http://drupal.org/node/45414

Please review the latest patches and respond on the issue.

You have missed a couple important design considerations. First most of the benefits of serving static pages come from avoiding Apache’s instatiation of a PHP thread which is around 20MB. Our file caching doesn’t handle this either.

The main benefit actually comes from avoiding loading Drupal to serve a page which your system doesn’t avoid. I’d argue the once you have loaded Drupal MySQL SQL cache if tuned is equivalent to a static file page being served from memory since the OS will cache file in memory.

In your performance tests with Siege you’ll find your results are likely skewed by the Drupal’s error pages which return a page that reports a DB connection failure in text rather than a legitimate HTTP error. That means siege, or other measuring tools, will treat failed pages as successful when doing heavy loads.

I strongly discourage testing on low powered hardware as Operating System performance and PHP/Apahce/MySQL resource consumption does not scale linearly. The result is that small limitations in hardware can be optimized but will be irrelavant in full-powered production servers where those resource constraints do not exist.

Colloboration on interface design and different levels of performance will be appreciated.

Thanks,
Kieran

Ken: sounds like we’re working on the same stuff :-) ...I’ll respond to your questions to the developer mailing list.

Kieran: as I mentioned on the developer mailing list, I’m aware of the core patch you refer to, and my approach is complementary to it. However, you may want to reread my blog post, since contrary to your claim, my approach specifically doesn’t invoke PHP nor Drupal when serving out a cached page to an anonymous visitor. That’s the whole point, and why I’m not so interested in the core patch you speak about. As for Siege’s results, as I mention in the article, I was indeed getting error pages until I tuned my MySQL settings to be able to handle the Siege attack; therefore, the results presented here do not include failed requests.

Hi Arto,

I was just wondering how your experiment/future module was coming along. Are you any closer to releasing the code? Your results look really promising.

Thanks!

The performance benefit displayed above is phenomenal. What a benefit to sites that handle 1000s of concurrent users, especially for low budget NPO’s and NGO’s who cannot afford the extra expense of higher performance servers to meet server load demands of growing communities.

I am most interested in your work for community projects we are currently working on.

Feature request… Can we have a per node cache time setting bypass of default time setting?

We would use this for pages that are 100% static and put an infinite cache time or a very long cache time like 42days. Can such pages also have a refresh cache button just in the event of an update to that page?

Thanks for the work
One Love

This project seems just the thing I’ve been looking for in Drupal. I’m definitely interested in hearing how it’s been working out and would love to try it myself. Drop me a line when you can so I can give it a whirl.

Thanks,
Jonathan

This sounds like a great project Arto. I’ve been tracking Drupal caching issues for 4.7 for a while. Other than the blockcache.module there hasn’t been much to look forward to.

I’ll be checking your project on Drupal.org with much interest!

Great work :) How do I download this static page addon. I have a dedicated server but with over 20,000 page hits an hour at peak, it is unable to support the site.

This is working great, especially considering you call it “alpha”. Thanks for a wonderful file caching solution, it’s a life saver.

Hello,

after a long time. I have been busy in some other stuff and now I am finally back in working on Drupal based web sites. I will be using your module for a web site we are planning, and if all things go right then we expect huge traffic. I will update on any issues I face and get involved as much as possible! Thanks again for the great module.

Also I will be needing the AJAX trick later :) but not right now.

I think it is very important to have cache-solution for those users who are logged in when you have a bigger community. If you could create such a solution that would be great but it seems to be a bit complicated because every page is created for the user. (the privatemsg & guestbook module may for example be a problem)

Great module.

I work in the field of medical education and I was always looking for a possibility to use a CMS like Drupal for collecting and editing content with a large community on a “staging” server, BUT export finished articles to plain HTML for raw speed on a dedicated production server for viewing.

Most CMSs suck… They still access their databases, sometimes they do some caching (but you do not know what the actually do cache…) It’s dreadful.

There are so many advantages of static export:


  • Cheap hosting and hardware – even a dual PIII is able to fill up a 100 Mbit connection and serve millions of requests.
  • It’s faster than any database driven website.
  • VERY IMPORTANT: better indexing with search engines like Google: plain HTML is simply the best “food” for Google

I love this module.

Will there be a version for Drupal 5.x?

Hey, cool I finally found your funky plugin. I always wondered why nobody implemented a static cache, just to find out that I didnt find it in the Drupal repository :)

Your benchmarks look impressive and I can't wait to get my hands on a Drupal 5.x version of this.

best, christoph

I’ll second the vote for caching pages for registered users. For sites with little customization, but which limit lots of stuff (commenting, posting, voting, creating content) to registered users, it’d be just great.

I think it makes sense to enable infinite cache for all pages that are unchanged. As I am not a developer, I may not have a clue about this. Is there a reason this isn’t possible at all?

Thanks for the module and looking forward to seeing a 5.x version of the module. It is greatly demanded. I’d even barter my services (SEO, usability and copywriting consulting) for the quicker development of the module, if you are interested.