Prrarf 0.1c: More on dates

Thanks to whoever it was who pointed at my previous table of RSS date formats for pointing at my previous table of RSS date formats. (I can't find your URL from my referrer logs, so sorry I'm not linking--I forget how I found you before.) People who cared about my "oops" might be interested my new RSS 2.0 LiveJournal style I made since then. It's not an S2 style because LJ's S2 is still broken, and it uses dc:date over pubDate because LJ's HTML cleaner still won't allow camelcased XML tags.

There's a bit more chatter about, from Mark Pilgrim, Simon Willison, Brad Choate, et al., so I thought I'd enumerate what my table meant to me, as an aggregator author.

  • As I said, the best I can do is 50%. Half the folks I read just plain don't give timestamps.
  • Support both pubDate and dc:date. Actually, this comes for free with Mark Pilgrim's rssparser, as it doesn't discern the difference.
  • Support both RFC2822 and the W3C profile of ISO8601. (This is a good example of a "profile," if you were curious what folks meant when discussing "a profile of RSS.") I'm forced to do this as well, because, as above, rssparser ignores whether it was pubDate or dc:date.
  • I don't bother with ISO8601 without a time. What day it was posted is not useful for sorting the entries; if I scan daily, everything new is the same date anyway. The time is important.
  • Be lenient in what you accept. Particularly, accept \d{,2} instead of \d\d, because numbers aren't always zero-prefaced. Also, it's easy; it sure wouldn't be worth my one such channel with it if it were hard.

So here's Prrarf's date handling code. The i['real_date'] is so items with no date set will have their dates omitted in the final output. (It uses default, the time the previous scan finished, so that the dateless items show after those with dates. This is my way of promoting items with dates, though I'm sure no particular feed producers care.) Also note I don't actually handle time zones. Heh.

def makeItemTime(i, default=lastAggrTime): """ Returns a UNIX system time in seconds from the epoch representing when item i was published. If i has no 'date', the given default, a Python time tuple, is used. """ t = default if i.has_key('date'): # RSS 1.0 shape: 2002-10-18T16:10:15-05:00 m = re.match(r"(?P\d{4,})-(?P\d\d?)-(?P\d\d?)T(?P\d\d?):(?P\d\d?):(?P\d\d?)(?:(?P[-+])(?P\d\d):(?P\d\d))?", i['date']) if m: t = map(int, map(m.group, ('year', 'month', 'day', 'hour', 'min', 'sec'))) + [0, 0, 0] i['real_date'] = 1 else: # RSS 2.0 shape: Mon, 2 Jun 2003 17:57:41 EDT m = re.match(r"(?:..., )?(?P\d\d?) (?P[^ ]+) (?P\d{4,})(?: (?P\d\d?):(?P\d\d?)(?::(?P\d\d?))? (?P.{3}))?", i['date']) if m: datebits = map(m.group, ('year', 'monthname', 'day', 'hour', 'min', 'sec')) datebits[1] = {'Jan':1, 'Feb':2, 'Mar':3, 'Apr':4, 'May':5, 'Jun':6, 'Jul':7, 'Aug':8, 'Sep':9, 'Oct':10, 'Nov':11, 'Dec':12}[datebits[1]] if None == datebits[3]: datebits[3] = datebits[4] = datebits[5] = 0 elif None == datebits[5]: datebits[5] = 0 t = map(int, datebits) + [0, 0, 0] i['real_date'] = 1 return time.mktime(t)

In other news, looking for the link I never found for the beginning of this article, I ran across the link to my previous RSS aggregator wishlist. So how well am I doing with Prrarf?

  • Mark Pilgrim's ultra-liberal RSS locator. This would be part of a subscription component, which I might try to con Joshua into building, since he's thinking of reviving his Teacup RSS reader project.
  • A strict parser for various flavors of RSS, plus Mark Pilgrim's ultra-liberal RSS parser to vocally fall back on. So half there.
  • Kit's temporal sifting features, which are even nicer now that I'm using l.m. orchard's BlosxomPaginate as well.
  • Kit's searching features. Strictly this is done, but it does it the stupid way Kit did it. File access overhead is too big compared to Radio's object database overhead: doing a search shoots apache to the top of top, which I'd prefer not. But other than that, what can I do? Index the news with Perlfect Search or some odd?
  • Kit-style grouping.
  • Some notion of read and unread items. I'm not sure I still want this.
  • Periodical scan as an option, if you count throwing it in cron. A "Scan Now" button on the page would be nice.
  • Irregular periodical scan.
  • P2P feed sharing.

So that's a lot in my list I haven't done. Question is, how much of this do I actually want nowadays? Not much.

Comments

comment

Build an RSS locator? Sure, I’m up for it. Do ya’ have any more details?

comment

Well, was thinking more the subscription component. You said you wanted Teacup to have one, and Prrarf has none, yet. CGI to list, add (plus with locator bookmarklet), remove feeds; an importable API with which Prrarf could enumerate the feeds. It could have a full locator component as well, but I was more thinking of a general subscription management interface.