Thursday, August 16, 2012

Why PageRank & Co. are inferior for blog ranking?


PageRank is one of the most frequent used algorithms for ranking traditional webpages based on the web link graph. It has been introduced by Page et al. and is based on the random surfer model. A website’s PageRank is defined as the probability of a random surfer visiting this website. 
The random surfer traverses the web by repeatedly choosing between two options: clicking on a random link on the current page or jumping to any website at random. The second option is necessary to make sure the random surfer also visits pages that have no incoming links and to make sure that it is possible to escape from pages that have no outgoing links. 
The PageRank algorithm is iterative and converges after a certain number of iterations depending on the used implementation. 

A very similar algorithm to PageRank is TrustRank
In contrast to PageRank, TrustRank gets initialized with a fixed set of trusty or untrusty web sites. The trust propagates through the web graph equally to the PageRank algorithm. 

Another approach is the Hyperlink-Induced Topic Search (HITS) algorithm by Kleinberg.
It is based on the concept of Hubs and Authorities. In the traditional view of the web Hubs are link directories and archives that only refer to information Authorities, which actual offer valuable information. 
HITS operates on a subgraph of the web that is related to a specific input query. Each page gets an Authority score and a Hub score. The Authority score get increase based on the Hub score of linking webpages and vice versa.


These traditional ranking algorithms are all based on the web link graph
However, traditional webpages show a different linking behaviour as blogs. Blogs offer different types of links, e.g. trackbacks or blogroll links, with different semantics. Furthermore, the blog link graph tends to be rather sparse in comparison to the overall web. 
Thus, tailor-made rankings for blogs are needed that also consider blog-specific characteristics and blogs' content.

Wednesday, August 1, 2012

Summary of "Credibility Improves Topical Blog Post Retrieval - Weerkamp et al."

The authors introduce 11 indicators of credibility to improve the effectiveness of topical blog retrieval. Their indicator are one blog and on post level. Beside some syntactic indicators, they also present the timeliness of posts, the regularity of blogs, and the consistency of blogs.

The timeliness of a post is defined as the temporal distance of a blog post to a news post of the same topic. In this paper, topics seem to be term occurrences. Nonetheless, it is very interesting to incorporate traditional media.

The posting/publishing behaviour of a blogger is called regularity. Hereby, the authors assume that a credible blog has a very regular posting behaviour. In contrast, related research often assumes this as an indicator for splogs.

The topical consistency of a blog represents its topical fluctuation. The authors define the consistency similar to the query clarity, which remembers me a bit of the tf/idf score. As contrast to related work, the authors do not use the natural ordering of posts.

Nevertheless, the author show that their indicator can improve the topical blog retrieval significantly (using the blog06 data set).
Take a look at the paper.