Tuesday, July 31, 2012

Summary of "Splog Filtering based on Writing Consistency - Liuwei et al."

CR: kinipela
Liuwei et al. describe a spam blog (splog) filtering technique based on three features: the writing interval, writing structure, and the writing topic of a blog. They argue that most spam detection mechanisms are designed for static webpage and miss the dynamic nature of blogs.

They define the consistency of the writing interval as the inverse variance of post update intervals. A high writing interval consistency implies a very constant update interval.

The authors also define a measure for consistency of writing structure. Unexpectedly, there is no NLP magic behind this; the measure simply relates the variation of words per post and the average number of words per post. The underlying assumption is that splogs are packed with keywords and their posts are all equally long. As contrast, normal blogger tend to deliver short and long posts depending on their daily mood.

The consistency on topic level is defined as the average topical similarity of posts. Each post gets compared with its preceding post. The topical similarity is defined as the cosine similarity of the posts' tf/idf word vectors. Thereby, blogs with a very high topical consistency tend to be auto-generated.

Finally, they introduced a filtering system and evaluated their feature set with three classification mechanisms (SVM, Bayes, C4.5) on the Blog06 data set. They showed that with a reduced feature number the same accuracy is reachable using their feature set. Further, one has to mention that their heuristic approach is language independent.

No comments:

Post a Comment