Tuesday, March 2, 2010

Matching HTML Tags with regular expression or why blogroll identification is not so easy

Yeah I just got it!
So trivial the problem even sounds, for me it was a hard lesson to learn.
Szenario:
Just walkng on the crawler and try to identify blogrolls, so far sounds easy. Yeah, if one will ever be so nice to put this information in a well-formed RSS2.0 feed.
But as far as I see each blog can easily insert a blogroll in his html content with nearly no standard. I have to admit that there several WordPress Plugins and other concept, which serve the user as creation helper.
Thus this generator just can be identified by tag-ids or css-classes like 'xoxo-blogroll' which can be easily changed, it is a hard way to extract this stuff in a general way.
So I end up on implementing an algorithm which extracts the inner html of an tag which contains 'blogroll' (so css class, ids and even title get recognized).
My first idea, yeah, just another nice example how unreadable regular expressions can speed up your code and copyright it simultanously.
Here my drafts:
]*?blogroll[^>]*?>(.*?) Here we go first draft for a div tag <(.+?)[^>]*?blogroll[^>]*?>(.*?) next step matching a random tag wich contains  blogroll anywhere
Note: I reuse a group directly in a regular expression, nice feature, especially for this use case.

But as you may notice, we have to consider multiple open and end tags of the same kind. So I just think about this problem some time and I just remembered to my theoretical informatic studies, where my prof taught about push-down automats and the type-3 grammatics... So at the end, it is prooved that one cannot check my use case with a regular expression, so I have to insert 2 count variable to check the right number of open and closing tags to get the whole content.

<([A-Za-z][A-Za-z0-9]*+)[^>]*blogroll[^>]*>

recognize the beginning tag and than just match to the beginning tag and count open and closed tags till there are more or equal closed than open.