The return of funky caching

So yeah, a 404 handler that gzips files that aren't being used much could save me some extant web space, instead of having so many files out on disk at a time. Phil says PHP author Rasmus Lerdorf coined the term funky caching for this in his Tips and Tricks presentation (but look here for now). I meant to do something like that at the time, but haven't yet. Instilled with hacking prowess from new projects--tbpy and a Python CGI I wrote for gzipping access logs before I download them--I do believe I will.

An answer to l.m.'s question "what happens if I want to edit old content, or I change templates, or what not?" is that, if the unarchiver is a 404 handler, any file you save with that URL automatically overrides the one in storage. After all, if a file is there, the 404 handler doesn't get called; the stored resource will never be unarchived if you have a resource by that name there already. Presumably you have a scheduled process that sticks unused files on the disk into storage, so eventually that process will run, and the old version, having never been pulled out over the new one, is overwritten.

c.z. robertson wrote, "I figure that it actually needs to be something more than just caching as Phil Ringnalda describes it. A few weeks ago I figured out that what I was doing was essentially re-inventing Make." It doesn't need to be more like Make--the caching doesn't need to be tied to your tool. Admittedly we were (or at least I was) originally thinking that, so you could apply your template when unarchiving, but the neat part of divorcing your tools from the cache is the cache can work with all your baking tools. I don't see why I would need to limit a funky cacher to my MT archives when it would work with the POT.py-published part of my site with no additional configuration.

So I wonder if there are any other questions, or if I should move on to the implementation step. A questions raised by the above is, "How do you tell a file is unused?" If Apache touches atimes when reading the files (will the filesystem handle that for Apache?), just use those. If for some reason that wouldn't work, you could use SSI to make a call out when the page is viewed to set the atime, or have another scheduled process crunch the access logs for that data. If you wanted really funky caching, just archive it anyway if the files are older (by some mactime) than a week, and pull it out when it's requested next. If it's immediately, it's immediately.

SSI, logs, and laziness are the ones I thought up first, of course, but they could interfere with the operation of a Blosxom if implemented naïvely. Blosxom only looks at one date, though (ctime? mtime?) so watching the atime should be a simple, effective solution. It would rather have the Blosxom nature, I think. Using the other times is a neat "trick"--I set up a culler for the Blagg instance I set up, since I didn't want all the old news hanging around. However, I also didn't want to remove any news until it left the feed, so whenever Blagg found an old item, it had to touch the file. But I was also using Blosxom to make a page of that news, so it had to make the old items still look old to Blosxom. Blosxom uses the mtime, so I had Blagg touch the atime instead (utime scalar(time()), (stat($i_fn))[9], $i_fn;) and have the culler look at the atime.

(That's mainly why I don't know if Apache serving a resource touches its atime: the last time I played with mactimes was that Blagg thing, and except for the touching, the file doesn't get opened. If the filesystem touches the atime whenever the file is opened (and that's the point of the atime, as I understand it!) then the simple solution works and all the better.)

Here's a question for which I have no pat answer: how do you keep from overflowing your disk space? On the request side, you can remember the last few folks who visited, and if one particular visitor is hitting a lot of archived pages, serve them out of the archive without writing to disk. How do you handle disk overflows in general, though? What if you rebuild your entire MT site but there's not enough space, because your MT site is too large to fit all on disk anymore? I guess you would have a ratchety threshold for how old a resource has to be to be stuck in the archive and removed from disk, and the lower you are on disk space, the lower you make that threshold. Then you depend on someone invoking the 404 handler, so you can clean up... That's not a good solution.

Yeah, I think I'll be speccing a wee funky caching system soon, especially if someone has suggestions for those error conditions (or just that particular one).