Thursday, August 16, 2012
Why PageRank & Co. are inferior for blog ranking?
PageRank is one of the most frequent used algorithms for ranking traditional webpages based on the web link graph. It has been introduced by Page et al. and is based on the random surfer model. A website’s PageRank is defined as the probability of a random surfer visiting this website.
The random surfer traverses the web by repeatedly choosing between two options: clicking on a random link on the current page or jumping to any website at random. The second option is necessary to make sure the random surfer also visits pages that have no incoming links and to make sure that it is possible to escape from pages that have no outgoing links.
The PageRank algorithm is iterative and converges after a certain number of iterations depending on the used implementation.
A very similar algorithm to PageRank is TrustRank.
In contrast to PageRank, TrustRank gets initialized with a fixed set of trusty or untrusty web sites. The trust propagates through the web graph equally to the PageRank algorithm.
Another approach is the Hyperlink-Induced Topic Search (HITS) algorithm by Kleinberg.
It is based on the concept of Hubs and Authorities. In the traditional view of the web Hubs are link directories and archives that only refer to information Authorities, which actual offer valuable information.
HITS operates on a subgraph of the web that is related to a specific input query. Each page gets an Authority score and a Hub score. The Authority score get increase based on the Hub score of linking webpages and vice versa.
These traditional ranking algorithms are all based on the web link graph.
However, traditional webpages show a different linking behaviour as blogs. Blogs offer different types of links, e.g. trackbacks or blogroll links, with different semantics. Furthermore, the blog link graph tends to be rather sparse in comparison to the overall web.
Thus, tailor-made rankings for blogs are needed that also consider blog-specific characteristics and blogs' content.