CR: kinipela |
They define the consistency of the writing interval as the inverse variance of post update intervals. A high writing interval consistency implies a very constant update interval.
The authors also define a measure for consistency of writing structure. Unexpectedly, there is no NLP magic behind this; the measure simply relates the variation of words per post and the average number of words per post. The underlying assumption is that splogs are packed with keywords and their posts are all equally long. As contrast, normal blogger tend to deliver short and long posts depending on their daily mood.
The consistency on topic level is defined as the average topical similarity of posts. Each post gets compared with its preceding post. The topical similarity is defined as the cosine similarity of the posts' tf/idf word vectors. Thereby, blogs with a very high topical consistency tend to be auto-generated.
Finally, they introduced a filtering system and evaluated their feature set with three classification mechanisms (SVM, Bayes, C4.5) on the Blog06 data set. They showed that with a reduced feature number the same accuracy is reachable using their feature set. Further, one has to mention that their heuristic approach is language independent.
No comments:
Post a Comment