[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

comments on the "Link analysis, Eigen Vectors and Stability" paper




Having forwarded the paper on Link Analysis, Eigen Vectors and
Stability, I had a chance to read it myself yesterday evening (did I
mention that  my evenings are quite exciting...?)

Anyways, I think this paper is quite interesting, as it connects up
three important ideas in the first part of the semester--Page Rank,
A/H algorithm (aka HITS) and LSI--all together.(**)

I will highly recommend it to you. You should be able to understand
most of the discussion, and at least some of the math. 

The basic aim of the paper is to analyze the stability of the HITS and 
Page-Rank algorithms with respect to small changes in link
configuration. The motivation is that any "intrinsic" measure of
importance must be stable with respect to small changes in link
structure. In otherwords, suppose you have crawled a million pages and 
found page p1 to have authority 1, you don't want its authority to go
down to .00003 when you craw (say) five more pages. 

We all knew that HITS algorithm has stability problems. For example,
if you have a total of m+n pages configured as an m cluster and an
n-cluster, and m > n. As given the A/H values of all nodes in the
n-cluster will be zero. But if we add just one bridge link between the 
two clusters the n-cluster A/H values will go from zero to non-zero--
(specifically, as all of you who attempted Hw 2 Qn 1 know, the ratio becomes 

  (m - n) + Sqrt( (m - n)^2 + 4)
  -----------------------------
              2
)

In otherwords HITS is known to be unstable. This type of instability
is bad since if the crawler misses a small percentage of links, the
perceived authority values can change drastically. 

It should also be reasonably intuitive that page-rank algorithm, when
applied to the same set of pages is likely to be more stable (as the
problem in Exam 1 would have indicated). 

The IJCAI-01 paper I forwarded (by Ng et al.) analyzes the exact
causes of this instability and proposes some remedies to made HITS
algorithm as stable as Page-Rank algorithm.

Not entirely surprisingly, the stability of Page-Rank algorithm comes
from the the random surfer model it employs where in with a small
probability the surfer decides not to follow the outlinks, but rather
shifts to one of the pages in the universe with uniform probability. 
So, the proposed changes to HITS algorithm also involve this
randomization. 

The paper also relates the notion of stability of authorities derived
from eigen vectors to the notion of the difference between the first
and second eigen values--the smaller this gap the more unstable the
eigen vectors on small perturbations.

Finally, they connect up LSI algorithm with the link analysis
algorithms. You all know that both have something to do with eigen
vectors. The difference is that in the case of LSI, we are interested
in the "subspace" spanned by these eigen vectors, while in the link
analysis, we are interested in the eigen vectors themselves. The
interesting insight, apparently quite well known in the numerical
analysis community, is that subspaces tend to be stable under small
perturbations although the eigen vectors may not (notice that you can
have different sets of (eigen) vectors spanning the same subspace). 
There is a very nice geometric illustration of this in Figure 2.

An altogether fun paper to read.


Rao
[May  1, 2001]

(**) Reminds me of something attributed to Euler. Euler apparently 
said that he considers the following equality as proof that God
exists:

  e^{i*pi} = -1

as it puts together four of the most important constants, the
transcendantals e and pi, the imaginary number i, and the base for
natural numbers, 1, all in one equation..