[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Second try at answering two of the mercator related questions



I had this half-guilty feeling that I may have given hand-wavy answers
to a couple of crawling related questions today. So, I went back and
looked at mercator paper again. Based on that, I would like to revise
my responses to two of the questions:

Thomas Hernandez (or was it Ehsan Laleka?) asked about the crawler
trap in terms of embedded session ids. 
  What the paper is saying is that before "cookies" became the main
  approach to keeping track of user browsing habbits, the web servers
  would dynamically modify the URLS on the page

  The idea is that suppose you asked for http://rakaposhi/cse494
  At first request from you, I will establish a unique session ID, and
  convert all the URLS in the cse494 index page so they have the
  session id embedded. The server would have a way of stripping the
  session-id information (e.g. it could be a cgi-bin script); but to
  the crawler two id-embedded URLs that refer to the same page look
  completely different. Since there are infinitely many possible
  session IDs, in principle you have infinite number of aliases to the
  same page.  The paper says that they handle this by
  "Content seen" tests--i.e. the second time they come to the same
  page, you know you have seen the page (because you stored its
  signature), and so you stop crawling further from that page. 
 
  The good news is that with cookies, this is a lesser problem.


In answer to a question by Dennis, I seemed to have suggested that
Mercator gets performance improvement because it uses multi-threaded
crawler with synchronous I/O (as against Google's single threaded
crawlers with asynchronous I/O). The point the paper makes is not one
of efficiency--but one of ease of software engineering. specifically,
they say that the multi-threaded synch I/O version is both easier to
understand and write (since some of the more painful concurrency
issues are delegated to the thread scheduling facilities of the OS). 

That is all..

Rao
[Oct 09, 2002]