Since a couple of weeks a friend of mine and I working on a blog crawler to saving the blogosphere. Since now we saved about 50000 blogs and worked around 3 Mio web pages. Sounds much, but it is not. We challenging a lot of other crawlers around including google. Overall crawlers usually get about 300 site per second. As contrast our crawler just get 2500 site per day. That really weird.
Concerning the fact that we distributed our crawler to gain more cumputation power, but here is the problem. To guaranty that all blogs stay linked in the database and to avoid duplicates in the db we need to define a central databank , which seem to be the bottleneg.
Every crawler client just waiting severall minute for the database just to answer a very simpel select statement.
So well my database knowledge is very limited to some base stuff teached at the university. Currently I try to speed up the query using CREATE INDEX.
After all last time I heard about asychronous distributed message queues. This seems to be a good option to get the whole job handling out the database. But I think that only the INSERTs consume so much time.
So to sum up, buying new hardware is a good option.