Thursday, August 16, 2012

Why PageRank & Co. are inferior for blog ranking?


PageRank is one of the most frequent used algorithms for ranking traditional webpages based on the web link graph. It has been introduced by Page et al. and is based on the random surfer model. A website’s PageRank is defined as the probability of a random surfer visiting this website. 
The random surfer traverses the web by repeatedly choosing between two options: clicking on a random link on the current page or jumping to any website at random. The second option is necessary to make sure the random surfer also visits pages that have no incoming links and to make sure that it is possible to escape from pages that have no outgoing links. 
The PageRank algorithm is iterative and converges after a certain number of iterations depending on the used implementation. 

A very similar algorithm to PageRank is TrustRank
In contrast to PageRank, TrustRank gets initialized with a fixed set of trusty or untrusty web sites. The trust propagates through the web graph equally to the PageRank algorithm. 

Another approach is the Hyperlink-Induced Topic Search (HITS) algorithm by Kleinberg.
It is based on the concept of Hubs and Authorities. In the traditional view of the web Hubs are link directories and archives that only refer to information Authorities, which actual offer valuable information. 
HITS operates on a subgraph of the web that is related to a specific input query. Each page gets an Authority score and a Hub score. The Authority score get increase based on the Hub score of linking webpages and vice versa.


These traditional ranking algorithms are all based on the web link graph
However, traditional webpages show a different linking behaviour as blogs. Blogs offer different types of links, e.g. trackbacks or blogroll links, with different semantics. Furthermore, the blog link graph tends to be rather sparse in comparison to the overall web. 
Thus, tailor-made rankings for blogs are needed that also consider blog-specific characteristics and blogs' content.

Wednesday, August 1, 2012

Summary of "Credibility Improves Topical Blog Post Retrieval - Weerkamp et al."

The authors introduce 11 indicators of credibility to improve the effectiveness of topical blog retrieval. Their indicator are one blog and on post level. Beside some syntactic indicators, they also present the timeliness of posts, the regularity of blogs, and the consistency of blogs.

The timeliness of a post is defined as the temporal distance of a blog post to a news post of the same topic. In this paper, topics seem to be term occurrences. Nonetheless, it is very interesting to incorporate traditional media.

The posting/publishing behaviour of a blogger is called regularity. Hereby, the authors assume that a credible blog has a very regular posting behaviour. In contrast, related research often assumes this as an indicator for splogs.

The topical consistency of a blog represents its topical fluctuation. The authors define the consistency similar to the query clarity, which remembers me a bit of the tf/idf score. As contrast to related work, the authors do not use the natural ordering of posts.

Nevertheless, the author show that their indicator can improve the topical blog retrieval significantly (using the blog06 data set).
Take a look at the paper.


Tuesday, July 31, 2012

Summary of "Domain-Specific Identification of Topics and Trends in the Blogosphere - Schirru et al."


The authors present a system called "Social Media Miner". This system extracts topics and the corresponding, most relevant posts.
The relevance is calculated using a link authority algorithm like PageRank. The main contribution of the paper is the topic detection and tracking mechanism.
Schirru et al. cluster blog post using a time windowing approach. To create the cluster they use a tf/idf vector for each blog post, k-means, and non-negative matrix factorization for label extraction. To define the number of clusters they use the residual sum of squares.

Nevertheless, their approach is rather simple. They cluster topics for a given period, find relevant terms (or labels), and visualize the term mentions over time as Trend Graph.

Check out the paper.

Summary of "Cool Blog Identification using Topic-based Models - Sriphaew et al."



The authors show how to identify cool blogs based on three assumptions: blogs tend to have definite topics, have enough posts, and tend to have a certain level of consistency among their posts.

The level of consistency or the topical consistency tries to measure whether a blogger focus on a solid interest thus it favours blogs with certain topics like reviews on mobile devices. It is based on a mixture of topic probabilities of posts (LDA). The authors measure the similarity preceding posts. Hereby, the similarity is the distance between the topic probability distributions, which is calculated using Euclidean, Kullback-Leibler, or Jensen-Shanon distance.

They conduct a "user study" based on a corporate blog data set and a single guy, who categorized 540 blogs in cool and not cool. Using a SVM implementation, the authors were able to show an accurate precision and recall for cool blog recognition.

This is a heuristic approach and can therefore be applied to any language following the same assumptions.
So, check out the paper.

Summary of "Splog Filtering based on Writing Consistency - Liuwei et al."


http://www.flickr.com/photos/kinipela/202607307/
CR: kinipela
Liuwei et al. describe a spam blog (splog) filtering technique based on three features: the writing interval, writing structure, and the writing topic of a blog. They argue that most spam detection mechanisms are designed for static webpage and miss the dynamic nature of blogs.

They define the consistency of the writing interval as the inverse variance of post update intervals. A high writing interval consistency implies a very constant update interval.

The authors also define a measure for consistency of writing structure. Unexpectedly, there is no NLP magic behind this; the measure simply relates the variation of words per post and the average number of words per post. The underlying assumption is that splogs are packed with keywords and their posts are all equally long. As contrast, normal blogger tend to deliver short and long posts depending on their daily mood.

The consistency on topic level is defined as the average topical similarity of posts. Each post gets compared with its preceding post. The topical similarity is defined as the cosine similarity of the posts' tf/idf word vectors. Thereby, blogs with a very high topical consistency tend to be auto-generated.

Finally, they introduced a filtering system and evaluated their feature set with three classification mechanisms (SVM, Bayes, C4.5) on the Blog06 data set. They showed that with a reduced feature number the same accuracy is reachable using their feature set. Further, one has to mention that their heuristic approach is language independent.

Wednesday, July 25, 2012

Summary of "Using Blog Content Depth and Breadth to Access and Classify Blogs" - Chen et al.

Here, I sum up another interesting paper concerning a content-based ranking of blogs.

The authors present a blog-specific filtering system that measures topic concentration and variation.
They asses the quality of blogs via two main aspects: content depth and breadth. This got motivated via the sparseness of links and the highly personal character of the blogosphere.
The related work essentially consists of two areas: blog and quality assessment.
D3 forces layout of German blogs
Blog assessment. PageRank, HITS, and Technorati's blog authority have two issues: sparseness of links and time lagging of score. Further, most blog search engines are based on simple retrieval models because they only access the limited content of feeds and have to struggle with real-time constraints.
Quality assessment. According to Joseph Juran, quality is the "fitness for use" of information. Common quality assessment metrics are based on heuristics for a specific situation. Thereby, researchers emphasise the differences in language, structure and importance of actuality of blogs. Further, blogs are more interesting, personal, and reflect the author's opinions/experiences. Thus, researches define the quality of a blog based on the blogger's expertise, trustworthiness, information quality, and its personal nature. In addition, the credibility of commentators also counts.

In essence, the authors present a score that relates 5 criterions.
The first criterion is the informativeness of a blog as the number of meaning full words. A meaningful word has a high tf/idf score. Secondly, the completeness of a blog indicates how much strongly related words from each mentioned topic are present.
Third criterion is the topic count per blog. Fourthly, the inter-topic distance specifies how much words of a post are shared between topics.
Finally, the topic mergence calculates the general overlap between topics.

Summary of "An effective coherence measure to determine topical consistency in user-generated content - Jiyin He et.al"

Here, I sum up an interesting paper concerning a content-based ranking of blogs.

A blog is relevant if it focuses on a central topic. This is called topical consistency.
The authors introduce the coherence score to measure the consistency.
It is based on the intra blog clustering structure relative to the clustering of the background collection.

One has to differentiate between short and long term interest in blogs.
Further, the key features of blogs are a strong social aspect and their inherent noisiness.
Forces Layout of blogs interlinkage using D3
The topical noise springs from random interest blogs or diaries. This creates topical diffuseness ( a loose clustering).
One has to find the blogger that is most closely associated with a specific topic.

Blogs mostly fail to maintain a central topical thrust. Nevertheless, the trend goes to rank full blogs to recommend the reader interesting feeds.
One has to take the time and the relevance of topics into account.
Thereby, recurring interest (time based) and focused interest (cohesiveness of language of posts) should get measured.
The authors' coherence score captures the topical focus and tightness of subtopics in each blog. Thus, it handles the focused interest.

Lexical cohesion is an alternative to the coherence score. It measures the semantic relation hips between content words.
Therfore, external thesauri like WordNet are used to build lexical chains. The number of chains reflect the number of distinct topics. A so called chain score is used to measure the significance of a lexical chain.
The lexical cohesion is sensitive to progression of topics, but blind to their hierarchical structure.

The coherence score gives the proportion of coherent document pairs relative to the background collection.
These pairs are calculated by thresholding the cosine similiarty of documents.
The score measures the relative tightness of the clustering for a blog and prefers structured document sets with fewer sub-clusters.

Thus, the coherence score captures the clustering structure of data, called topical consistency.
It is independent of external resources and adapts to the fast changing environment of blogs.
Its complexity is O(average document length * number of documents ^2) and it can be used beyond text data (eg. blog structure or linkage).
It gets integrated into a blog ranking for boosting the topical relevant and topical consistent blogs.
Jiyin He, Wouter Weerkamp, Martha LarsonMaarten de Rijke: An effective coherence measure to determine topical consistency in user-generated content. IJDAR 12(3): 185-203 (2009)

Monday, June 25, 2012

Some Links from HCI Research


Acoustic radiation pressure: Radiation pressure--the history of a mislabeled tensor by Robert T. Beyer, a summary/review paper about 100 year-old history of radiation pressure. A more simple explanation can be found on the German Wikipedia. Essentially, this effect occurs if acoustic waves in one medium shoot at another target medium. If the frequency is higher than the time a target medium needs to stretch, than air particles get reflected back to the sound source. Furthermore, you can imagine the effect better if you think about the water-air medium change. Check Paper on water-air interface experiment
Tangibles go on market Appmates, little racing cars for your iPad 
Think about output to your brain Switching Neurons, Research Area is called Optogenetics, might be interesting for the normal nerve system as well. 
Cheat Sheet for Statistics Just in case, you need a refresh Cheat Sheet
Hick's Law in Mortal Combat Webpage discusses the influence of choices in martial arts. 

Monday, May 21, 2012

Lost in Storage - Find data zombies using Sequoia

Ever wondered where all your disk space went?

Check out SequoiaView!
It is a pretty nice visualization tool from university of Eindhoven that shows you a explorable treemap of your disk space usage. Thereby you can pretty fast identify old VM images or installers which are lying around wasting your storage.
As you can see in the image below, you just have to follow the huge blobs to find the wasted space.


Check out the homepage of the project:
http://w3.win.tue.nl/nl/onderzoek/onderzoek_informatica/visualization/sequoiaview//

Wednesday, May 16, 2012

Touch Paint

How to make your own touch pad and implement a simple drawing app with it?
This was the challenge during the HCI Research lecture by Baudisch at the Hasso-Plattner-Institute.
Probably for me even a harder challenge caused by my lag of programming skills with C/C++ and my lag of image processing knowledge. Nevertheless, after quiet a time I handled it. Check out the video!

The right picture in the video is the surface of my own touch pad. I build it using the instructions of Anne (http://www.anneroudaut.fr/diy/acrylicpad.html).
Glowing tips
My Pad
To implement a drawing app Anne already provides a simple frame for your app. The frame is implemented in C using OpenCV. I prefered to implement the whole stuff in C++, because the OpenCV API is easier to understand and one gets rid of all this memory management stuff.
And remember to set all the necessary path variable (include, libary, execution).
Check out my visual studio 2010 project ZIP. (Just a prototype^^)





Thursday, March 22, 2012

Linux Shell Output Redirect

If you like to run a program in total silence and write every output to a file of your choice use:
YOURCOMMAND &> YOURFILE
This will redirect everything to the specified file.
Helps for cronjobs or huge output generators like Hadoop.

Tuesday, March 13, 2012

CouchDB Lucene: Connection refused

Yesterday a got the following exception from my couchdb:
Traceback (most recent call last):
  File "/opt/couchdb-lucene-0.7-SNAPSHOT/tools/couchdb-external-hook.py", line 40, in main
    resp = respond(res, req, opts.key)
  File "/opt/couchdb-lucene-0.7-SNAPSHOT/tools/couchdb-external-hook.py", line 81, in respond
    res.request(method, path, headers=req_headers)
  File "/usr/lib/python2.6/httplib.py", line 914, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib/python2.6/httplib.py", line 951, in _send_request
    self.endheaders()
  File "/usr/lib/python2.6/httplib.py", line 908, in endheaders
    self._send_output()
  File "/usr/lib/python2.6/httplib.py", line 780, in _send_output
    self.send(msg)
  File "/usr/lib/python2.6/httplib.py", line 739, in send
    self.connect()
  File "/usr/lib/python2.6/httplib.py", line 720, in connect
    self.timeout)
  File "/usr/lib/python2.6/socket.py", line 561, in create_connection
    raise error, msg
error: [Errno 111] Connection refused
Because it worked like a charme for six month until now, I got pretty depressed. To solve the issue a begun to randomly search the web for this stack trace of my lucene view. But I could not find anything useful until now. After reading the third tutorial about how to setup lucene with couchdb, I figured out that the couchdb-lucence is not running anymore. Therefore after restarting the couchdb-lucene daemon everything works fine again.

Just execute this to start your couchdb-lucene again

    nohup /opt/couchdb-lucene-0.7-SNAPSHOT/bin/run &

Wednesday, February 29, 2012

Real-Time measurement in C

For those of you, who could not find it somewhere else.
Here is the code snippet to get time measurements exact to  the microsecond on a Linux system with pure C.
Fun fact, the struct is already there, should be defined in time.h
rt_printk is real-time printf to read your output use command dmesg
 
#include 
#include 
#include 
#include 
 
int main(void)
{
 char buffer[30];
 struct timeval tv;
 time_t curtime;
 gettimeofday(&tv, NULL);
 curtime=tv.tv_sec;
 strftime(buffer,30,"%m-%d-%Y  %T.",localtime(&curtime));
 rt_printk("%s%ld\n",buffer,tv.tv_usec);
 return 0;
}

Wednesday, February 15, 2012

Social Networks and Academic Research

The world is getting faster and faster, but still the most reputation in research is in printed journals.
Now it seams that the times change. There is an upcoming development of social networks for researchers.
These are not like Facebook with sharing pics and useless stuff to procrastinate. Instead the research networks focus on publications and the answer of small research questions in collaborative manner. (see also German Article of Welt-Online)
So check it out, it might become an advantage soon.

researchgate - Social Researcher Network (German startup)
academia - Social Researcher Network (US version)
mendeley - Collaborative Paper Plattform

Wednesday, February 8, 2012

Start of an idea - Eclipse Badges

Do you ever coded day and night without having enough benefit from it. Here it comes the Eclipse Badges Extennsion. Okay, it is not yet developed, but the idea is cool and to cool to die.

So here some bullets for what you could earn badges while coding:

  • X hours coding at once
  • Master of Shortcuts
  • Challenger of Refactorings
  • Best Refactorings in a Row
  • longest class name
  • Deepest Hierarchy
  • lines of code per minute
  • longest method
  • deepest call hierarchy
  • Quick-fix Master
  • Web of Eclipse (a lot of complex dependencies)
  • longest build time ever
  • Antitrust, Eclipse as notepad
  • Archaeologist open the oldest projects
  • Plugin Master - you can catch them all
  • most active views
  • nightly coding is appreciated
  • Your Rank as developer in the Eclipse Badge Community
All this stuff is coming soon...

Please comment for more ideas!