In order to actually write some code, I mostly ignored the spec I wrote and just turned out some code. So, here's Stapler.py 0.9.here's Stapler.py 0.9. (How come I can never link things right the first time, and don't test?) It needs:
- Joe Gregario's httpcache.py, included.
- The CSS stuff from the other day, included.
- Vinay Sajip's PEP 282-compliant logging module, not included.
At the moment you just configure files in the feeds directory correctly and it'll do the rest (ie, you run it as python stapler.py, without options). I've included two example feeds, one for a generic web comic strip that doesn't exist, and one for librarian.net. Please don't use them without changing them. The web comic doesn't exist and Jessamyn isn't affiliated with this in the least (other than I'm going to move to making the official librarian.net feed with Stapler.py when it's done).
The web comic feed is an "XML" type feed: an XML document with a driver tag and a format tag. The subtags (eg, selector) are passed to the appropriate function (fetch for drivers and write for formats) of the plugin given in the name attribute, as Python "kwargs." So it's the names given the parameters of the driver/format function that determines what tags you use in the XML file.
The librarian feed, on the other hand, is a "Python" type feed: just Python code that gets run with the drivers and formats in callable scope. It does some conversion (the blockquote change I have Stapler.root doing with the undocumented callback feature) that you couldn't (currently) do with an XML feed, so this is actually necessary. (The XML type is actually a shortcut)
What's left to do:
- Formats and drivers. I only have the regex and CSS selector drivers, and a basic RSS 2.0 format (so adding more options to that is one of the things to do as well).
- Multifeed processing: in just throwing this version together, I tied sources to feeds more tightly than the spec says. This is bad because I can't make my comics feed in XML. I can, however, make a Python feed that calls multiple
fetches, building one list that I then pass to a singlewrite. Is it OK to limit it to Python feeds like this, or should I do it more like Stapler.root (make a specialaggregatedriver that can access the scanned items of other feeds, which also means I'll need to make those other feeds keep their scanned items past the one feed parsing somehow)? - Command line options like logging output level control and which feeds to process. The latter is especially necessary if I'm cutting out the scan-every-N-hours configuration option on the assumption you can configure that in the cron commands you set up to run Stapler.py. (But should I do that?)
- Don't assume tags in XML documents only contain text. Maybe even pass the xml.dom.Element in? But then you constrain the Python type feeds. Maybe have separate
fetchandfetchFromXMLfunctions in each module or something? - Figure out how the hell normal people are supposed to use the thing, assuming I intend for normal people to use the thing.
- Converter for Stapler.root feeds.
Lastly, an interesting bit: I finally figured out how to make a package "load itself"--see __init__.py in the formats or drivers directories.