RSS generated bandwidth is a problem again, so says Wired News. As Dave Winer was ditheratied (looks like it should end up archived here, but it's the last line of the Wired article anyway), any solution "require[s] developers to play nice." Doubtless there are enough aggregators still not maximizing their use of HTTP to show that's no good assumption. (I still get referrals from my comment in that thread.) Regardless, here are a couple:
- If you use Perl, use Sean M. Burke's XML::RSS:TimingBot. If you don't, use any precreated RSS library that supports conditional HTTP GET (or, if you know what you're doing, write your own library that supports the timing tags like TimingBot).
- Subscribe to an aggregate list of new weblogs like your personal blo.gs feed, and only fetch a site's feed if it's showed up in the list since you last fetched.
And the shiny new idea I care too dearly about: let web services subscribe to your weblog by providing a URI to which you will post your new entries using the weblog posting API of your choice. For that matter, Atom could grow this pretty easily:
<link rel="service.subscribe" type="application/x.atom+xml" href="http://markpasc.org/mtx/atom.cgi/subscribe" title="Subscribe to Atom API posts of markpasc.org weblog" />(For that matter you could use TrackBack--or maybe this idea could mutate into a replacement of TrackBack with Atom API posting. Or maybe use a network of forwarding posters, so you can subscribe to posts or at least notifications from the outside aggregator of your choice, as above with lists. Or maybe maybe maybe maybe.)
As I said, all of this is how to improve client behavior. Poorly behaving clients are the real problem: all it takes is one client downloading the entire feed every five minutes to undo the gains of a hundred kind aggregators. It's up to you whether and how you punish rude clients; I admit on occasion I've manually redirected bad clients to a feed with an error message. I'm loath to automate it (if only because that's what Slashdot does), but impermanent throttling, or "I know you just requested that so I'm going to give you 304s anyway," could work. Except 304s when they weren't requested are probably against the spec.
Does anyone have any other solutions to share?
Comments
comment
I love how Dave assumes that (a) there is a problem, and (b) there is a technological solution, and (c) he will obviously be in the center of it. Completely sidestepping the inconvenient truth that he has the least understanding of HTTP of any aggregator author (or web server author, for that matter — Jesus Christ, the man built the only web server on Earth that doesn’t support 404). The community practically had to beat him over the head with the very concept of ETags and Last-Modified headers. Radio is the *last* aggregator on Earth that doesn’t support gzip compression. He has no clue how to build scalable web services (as evidenced by the piss-poor design of weblogs.com and constant outages of radio.weblogs.com).
Any Wired reporter who goes to him for quotes is fulfilling everything Dave hates about lazy professional journalists (except when they’re being lazy by talking to him, of course).
comment
I’m a little slow (well, a lot, really) today: is your shiny idea that I post a URI to your service.subscribe URI, saying that I want you to post new entries to my URI?
Bleah, that doesn’t describe very smoothly, does it? Sounds interesting, but I’m not sure it works for the right people. I could happily run a process that posts to however many people are reading me, and most of the people reading me have an addressible server running somewhere or other, but at first thought it seems like the people most in trouble (super-limited bandwidth, or super-high demand) aren’t likely to be able to swing it: you aren’t likely to be unable to afford a few thousand 304s a day, but be able to afford firing up a script that does a few hundred posts for every entry, and if you have tens or hundreds of thousands of subscribers, pushing to them all seems like it would be expensive. It also seems likely that as you increase your readership into the tens of thousands (or move away from techie bloggers), you would get a lot more people who don’t have a URI or a static unfirewalled IP addresses where they can run a server for you to post to. Maybe not, though: what do I know?
comment
Mark Pilgrim- People in glass houses shouldn’t throw stones.
comment
Phil, those are good points, yes. It may be the case that the burden of 304s is small enough that switching to real push won’t make a difference. (That’s the idea of 304s, anyway, to minimize that difference.) In that case it’s still a matter of applying any or all of the tech I elided parenthetically that as far as I know has been big talk with little action for quite a while.
The shiny idea doesn’t necessarily involve the subscription being a web service, but the more simple part that the APIs people have created for “humans” (desktop clients mostly) to push entries to weblogs can also be used by weblogs to push entries to aggregators. I’m not aware of anyone using them like that yet—and as you point out, for good reason.
(Aside: I found looking at my April logs after this post that someone really high in my referrer logs was there for continuing to request a feed I switched to 301 Moved Permanently long ago. Bad Syndirella, bad. April was the first month ever I had more than a gig of traffic. If this is really a problem and the solution depends on clients behaving well…)
comment
The history of Userland’s poor Etag support is well-documented on scripting.com http://www.google.com/search?&q=etag%20site:archive.scripting.com (couched in the language of “hey, everybody, look at this amazing new HTTP feature I just noticed in 2002, a mere 8 years after the HTTP 1.0 specification was released”). That excellent “Conditional GET for RSS Hackers” article that everybody links to was written because Dave couldn’t grasp the concept (and couldn’t be bothered to read the spec — real men don’t read specs, they write apps… shitty apps, apparently). Note the dates on that article, and then the dates when Dave “magically” discovers HTTP 304 codes.
Radio’s lack of gzip support is noted here: http://www.sauria.com/blog/articles/aggregator-gzip.html
Frontier’s lack of 404 support is noted here: http://www.unix-girl.com/blog/archives/001293.html
The sorry state of 301 support in all aggregators is noted here: http://diveintomark.org/archives/2003/04/23/inbriefinsomniacedition and http://diveintomark.org/archives/2004/03/29/direct-deposit and http://diveintomark.org/archives/2003/07/21/atomaggregatorbehaviorhttp_level
I can’t possibly make this stuff up, folks.
comment
Hmmm, maybe you could detect the bandwidth wasting aggregators and insert an “annoying, changing” explanatory post most recently in the feed explaining the problem. The annoyance would perhaps prompt users to switch aggregators, or pressure authors to fix up their aggregators.