[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Second try at answering two of the mercator related questions
I had this half-guilty feeling that I may have given hand-wavy answers
to a couple of crawling related questions today. So, I went back and
looked at mercator paper again. Based on that, I would like to revise
my responses to two of the questions:
Thomas Hernandez (or was it Ehsan Laleka?) asked about the crawler
trap in terms of embedded session ids.
What the paper is saying is that before "cookies" became the main
approach to keeping track of user browsing habbits, the web servers
would dynamically modify the URLS on the page
The idea is that suppose you asked for http://rakaposhi/cse494
At first request from you, I will establish a unique session ID, and
convert all the URLS in the cse494 index page so they have the
session id embedded. The server would have a way of stripping the
session-id information (e.g. it could be a cgi-bin script); but to
the crawler two id-embedded URLs that refer to the same page look
completely different. Since there are infinitely many possible
session IDs, in principle you have infinite number of aliases to the
same page. The paper says that they handle this by
"Content seen" tests--i.e. the second time they come to the same
page, you know you have seen the page (because you stored its
signature), and so you stop crawling further from that page.
The good news is that with cookies, this is a lesser problem.
In answer to a question by Dennis, I seemed to have suggested that
Mercator gets performance improvement because it uses multi-threaded
crawler with synchronous I/O (as against Google's single threaded
crawlers with asynchronous I/O). The point the paper makes is not one
of efficiency--but one of ease of software engineering. specifically,
they say that the multi-threaded synch I/O version is both easier to
understand and write (since some of the more painful concurrency
issues are delegated to the thread scheduling facilities of the OS).
That is all..
Rao
[Oct 09, 2002]