[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Thinking cap questions II: Distance measures; high-dimensional spaces etc.



First a couple of fun links followed by more thinking-cap questions


First a couple of fun links:

0. A really good paper that discusses vexing search issues in an important web-search domain
       http://www.jmlg.org/papers/wankendon05.pdf

1. The lincoln play by woody allen that I mentioned (about "how long should a man's legs be") is here
http://rakaposhi.eas.asu.edu/lincoln-query-woody-allen.pdf

2. here is a bag-of-words analysis of Bush's state of the union addresses through 2007
http://www.nytimes.com/ref/washington/20070123_STATEOFUNION.html

3. Link to the click-bias study (that showed that people tend to assume Google knows better what is relevant more
than they themselves can do). This is a good paper to skim to get an idea of what is involved in careful user studies
(contrast their user studies with the  user studies in "0" above--which were much more profitable--as in footnote 1 of that paper).

http://www.cs.cornell.edu/People/tj/publications/joachims_etal_05a.pdf


==========Thinking cap questions=============


0. A comment on "Reduced Shakespere company" is that the more you know the original the funnier it gets. Keeping that in mind, what is a sedate but academic summary of the point of the paper in "0" above?
  (for example, why do the say that instead of idf they found df measure to be more suitable for their purposes)?

1. What is the highest value the normalized euclidean distance measure can take (recall that it is just the distance between two unit vectors in the n-dimensions).


2. What is the effect of alpha/beta/gamma values in the Rocchio relevance feedback method on precision and recall?


3. Does it make sense to do stop-word elimination in the context of vector similarity ranking with tf/idf weights?


=====================