Showing posts with label Topic. Show all posts
Showing posts with label Topic. Show all posts

Wednesday, July 25, 2012

Summary of "Using Blog Content Depth and Breadth to Access and Classify Blogs" - Chen et al.

Here, I sum up another interesting paper concerning a content-based ranking of blogs.

The authors present a blog-specific filtering system that measures topic concentration and variation.
They asses the quality of blogs via two main aspects: content depth and breadth. This got motivated via the sparseness of links and the highly personal character of the blogosphere.
The related work essentially consists of two areas: blog and quality assessment.
D3 forces layout of German blogs
Blog assessment. PageRank, HITS, and Technorati's blog authority have two issues: sparseness of links and time lagging of score. Further, most blog search engines are based on simple retrieval models because they only access the limited content of feeds and have to struggle with real-time constraints.
Quality assessment. According to Joseph Juran, quality is the "fitness for use" of information. Common quality assessment metrics are based on heuristics for a specific situation. Thereby, researchers emphasise the differences in language, structure and importance of actuality of blogs. Further, blogs are more interesting, personal, and reflect the author's opinions/experiences. Thus, researches define the quality of a blog based on the blogger's expertise, trustworthiness, information quality, and its personal nature. In addition, the credibility of commentators also counts.

In essence, the authors present a score that relates 5 criterions.
The first criterion is the informativeness of a blog as the number of meaning full words. A meaningful word has a high tf/idf score. Secondly, the completeness of a blog indicates how much strongly related words from each mentioned topic are present.
Third criterion is the topic count per blog. Fourthly, the inter-topic distance specifies how much words of a post are shared between topics.
Finally, the topic mergence calculates the general overlap between topics.

Summary of "An effective coherence measure to determine topical consistency in user-generated content - Jiyin He et.al"

Here, I sum up an interesting paper concerning a content-based ranking of blogs.

A blog is relevant if it focuses on a central topic. This is called topical consistency.
The authors introduce the coherence score to measure the consistency.
It is based on the intra blog clustering structure relative to the clustering of the background collection.

One has to differentiate between short and long term interest in blogs.
Further, the key features of blogs are a strong social aspect and their inherent noisiness.
Forces Layout of blogs interlinkage using D3
The topical noise springs from random interest blogs or diaries. This creates topical diffuseness ( a loose clustering).
One has to find the blogger that is most closely associated with a specific topic.

Blogs mostly fail to maintain a central topical thrust. Nevertheless, the trend goes to rank full blogs to recommend the reader interesting feeds.
One has to take the time and the relevance of topics into account.
Thereby, recurring interest (time based) and focused interest (cohesiveness of language of posts) should get measured.
The authors' coherence score captures the topical focus and tightness of subtopics in each blog. Thus, it handles the focused interest.

Lexical cohesion is an alternative to the coherence score. It measures the semantic relation hips between content words.
Therfore, external thesauri like WordNet are used to build lexical chains. The number of chains reflect the number of distinct topics. A so called chain score is used to measure the significance of a lexical chain.
The lexical cohesion is sensitive to progression of topics, but blind to their hierarchical structure.

The coherence score gives the proportion of coherent document pairs relative to the background collection.
These pairs are calculated by thresholding the cosine similiarty of documents.
The score measures the relative tightness of the clustering for a blog and prefers structured document sets with fewer sub-clusters.

Thus, the coherence score captures the clustering structure of data, called topical consistency.
It is independent of external resources and adapts to the fast changing environment of blogs.
Its complexity is O(average document length * number of documents ^2) and it can be used beyond text data (eg. blog structure or linkage).
It gets integrated into a blog ranking for boosting the topical relevant and topical consistent blogs.
Jiyin He, Wouter Weerkamp, Martha LarsonMaarten de Rijke: An effective coherence measure to determine topical consistency in user-generated content. IJDAR 12(3): 185-203 (2009)