Blog of Philipp Berger: July 2012

Tuesday, July 31, 2012

Summary of "Domain-Specific Identification of Topics and Trends in the Blogosphere - Schirru et al."

The authors present a system called "Social Media Miner". This system extracts topics and the corresponding, most relevant posts.

The relevance is calculated using a link authority algorithm like PageRank. The main contribution of the paper is the topic detection and tracking mechanism.

Schirru et al. cluster blog post using a time windowing approach. To create the cluster they use a tf/idf vector for each blog post, k-means, and non-negative matrix factorization for label extraction. To define the number of clusters they use the residual sum of squares.

Nevertheless, their approach is rather simple. They cluster topics for a given period, find relevant terms (or labels), and visualize the term mentions over time as Trend Graph.

Check out the paper.

Summary of "Cool Blog Identification using Topic-based Models - Sriphaew et al."

The authors show how to identify cool blogs based on three assumptions: blogs tend to have definite topics, have enough posts, and tend to have a certain level of consistency among their posts.

The level of consistency or the topical consistency tries to measure whether a blogger focus on a solid interest thus it favours blogs with certain topics like reviews on mobile devices. It is based on a mixture of topic probabilities of posts (LDA). The authors measure the similarity preceding posts. Hereby, the similarity is the distance between the topic probability distributions, which is calculated using Euclidean, Kullback-Leibler, or Jensen-Shanon distance.

They conduct a "user study" based on a corporate blog data set and a single guy, who categorized 540 blogs in cool and not cool. Using a SVM implementation, the authors were able to show an accurate precision and recall for cool blog recognition.

This is a heuristic approach and can therefore be applied to any language following the same assumptions.
So, check out the paper.

Summary of "Splog Filtering based on Writing Consistency - Liuwei et al."

http://www.flickr.com/photos/kinipela/202607307/

CR: kinipela

Liuwei et al. describe a spam blog (splog) filtering technique based on three features: the writing interval, writing structure, and the writing topic of a blog. They argue that most spam detection mechanisms are designed for static webpage and miss the dynamic nature of blogs.

They define the consistency of the writing interval as the inverse variance of post update intervals. A high writing interval consistency implies a very constant update interval.

The authors also define a measure for consistency of writing structure. Unexpectedly, there is no NLP magic behind this; the measure simply relates the variation of words per post and the average number of words per post. The underlying assumption is that splogs are packed with keywords and their posts are all equally long. As contrast, normal blogger tend to deliver short and long posts depending on their daily mood.

The consistency on topic level is defined as the average topical similarity of posts. Each post gets compared with its preceding post. The topical similarity is defined as the cosine similarity of the posts' tf/idf word vectors. Thereby, blogs with a very high topical consistency tend to be auto-generated.

Finally, they introduced a filtering system and evaluated their feature set with three classification mechanisms (SVM, Bayes, C4.5) on the Blog06 data set. They showed that with a reduced feature number the same accuracy is reachable using their feature set. Further, one has to mention that their heuristic approach is language independent.

Wednesday, July 25, 2012

Summary of "Using Blog Content Depth and Breadth to Access and Classify Blogs" - Chen et al.

Here, I sum up another interesting paper concerning a content-based ranking of blogs.

The authors present a blog-specific filtering system that measures topic concentration and variation.

They asses the quality of blogs via two main aspects: content depth and breadth. This got motivated via the sparseness of links and the highly personal character of the blogosphere.

The related work essentially consists of two areas: blog and quality assessment.

D3 forces layout of German blogs

Blog assessment. PageRank, HITS, and Technorati's blog authority have two issues: sparseness of links and time lagging of score. Further, most blog search engines are based on simple retrieval models because they only access the limited content of feeds and have to struggle with real-time constraints.

Quality assessment. According to Joseph Juran, quality is the "fitness for use" of information. Common quality assessment metrics are based on heuristics for a specific situation. Thereby, researchers emphasise the differences in language, structure and importance of actuality of blogs. Further, blogs are more interesting, personal, and reflect the author's opinions/experiences. Thus, researches define the quality of a blog based on the blogger's expertise, trustworthiness, information quality, and its personal nature. In addition, the credibility of commentators also counts.

In essence, the authors present a score that relates 5 criterions.

The first criterion is the informativeness of a blog as the number of meaning full words. A meaningful word has a high tf/idf score. Secondly, the completeness of a blog indicates how much strongly related words from each mentioned topic are present.

Third criterion is the topic count per blog. Fourthly, the inter-topic distance specifies how much words of a post are shared between topics.

Finally, the topic mergence calculates the general overlap between topics.

The authors conduct a small user study to prove their scoring.

Chen, M. and Ohta, T. (2010), Using Blog Content Depth And Breadth To Access and Classify Blogs. International Journal of Business and Information Volume 5, number 1, June 2010.

Summary of "An effective coherence measure to determine topical consistency in user-generated content - Jiyin He et.al"

Here, I sum up an interesting paper concerning a content-based ranking of blogs.

A blog is relevant if it focuses on a central topic. This is called topical consistency.
The authors introduce the coherence score to measure the consistency.
It is based on the intra blog clustering structure relative to the clustering of the background collection.

One has to differentiate between short and long term interest in blogs.
Further, the key features of blogs are a strong social aspect and their inherent noisiness.

Forces Layout of blogs interlinkage using D3

The topical noise springs from random interest blogs or diaries. This creates topical diffuseness ( a loose clustering).
One has to find the blogger that is most closely associated with a specific topic.

Blogs mostly fail to maintain a central topical thrust. Nevertheless, the trend goes to rank full blogs to recommend the reader interesting feeds.
One has to take the time and the relevance of topics into account.
Thereby, recurring interest (time based) and focused interest (cohesiveness of language of posts) should get measured.
The authors' coherence score captures the topical focus and tightness of subtopics in each blog. Thus, it handles the focused interest.

Lexical cohesion is an alternative to the coherence score. It measures the semantic relation hips between content words.
Therfore, external thesauri like WordNet are used to build lexical chains. The number of chains reflect the number of distinct topics. A so called chain score is used to measure the significance of a lexical chain.
The lexical cohesion is sensitive to progression of topics, but blind to their hierarchical structure.

The coherence score gives the proportion of coherent document pairs relative to the background collection.
These pairs are calculated by thresholding the cosine similiarty of documents.
The score measures the relative tightness of the clustering for a blog and prefers structured document sets with fewer sub-clusters.

Thus, the coherence score captures the clustering structure of data, called topical consistency.
It is independent of external resources and adapts to the fast changing environment of blogs.
Its complexity is O(average document length * number of documents ^2) and it can be used beyond text data (eg. blog structure or linkage).
It gets integrated into a blog ranking for boosting the topical relevant and topical consistent blogs.

Jiyin He, Wouter Weerkamp, Martha Larson, Maarten de Rijke: An effective coherence measure to determine topical consistency in user-generated content. IJDAR 12(3): 185-203 (2009)