Slicing HTML with CSS selectors

This code extracts HTML that matches CSS3-ish selectors.

I won't say I outright stole Mark Pilgrim's HTML parser code, but how many ways are there to use sgmllib? I certainly used his code as a model though.

It doesn't support combinators (foo + bar, foo > bar, foo ~ bar), though it does descendence (foo bar) just fine. Actually it'll parse the combinators, it just won't denote them in any way; the HTML parser will ignore them.

It doesn't do pseudos (e.g., :first-child) either. That was too much engineering up front to accomplish in one thunk, and, well... this is already a step up from the application it's replacing.

The attribute comparators it supports are:

=
Attribute must exactly equal the value.
~=
Attribute must contain the value in a space-separated list (e.g., class attributes).
|=
Attribute must contain the value in a hyphen-separated list (i.e., lang attributes).
^=
Attribute must begin with the value.
$=
Attribute must end with the value.
*=
Attribute must contain, somewhere, the value.

For example, selector img[src*="foo"] will match all img tags with the text "foo" in their src URLs.

Yes, this is for Stapler.py. Thanks for asking.

TrackBack

Listed below are links to weblogs that reference Slicing HTML with CSS selectors:

» Oops CSS HTML from markpasc.org
Yes, I forgot to upload the files for the previous entry. No, they're no longer missing. [Read More]

» Stapler.py 0.9 from markpasc.org
Wow, it runs. Yay. [Read More]