(Yikes--here is the complete version--blog comments welcome) Another way to view the approximation of R(d|Q,U,{d1...dk}) in the traditional IR

One possibly frustrating part of today's class is the impedence mismatch between the fact that, the class is focused on how traditional IR is done, and the fact that while you may not know much about traditional IR, you do know about the ways in which it has been extended/used in search engines.

Here is a slightly alternative view to interpret how we get the traditional IR model from the "true" relevance model.

We agreed that Relevance R is a function of D, Q, U and the already shown docs {d1...dk}

The next question is how do we represent (or ignore representing) the various components:

document D:
--Can be represented in terms of "meaning" (too hard ;-)
--Can be represented in terms of just the words it has
     --Can be seen as a "set" of key words; "bag" of keywords; or "shingles" of keywords


User U
-- can be represented in terms of the "properties" of the user
     --interests of the user
     --features (e.g. age, salary, domicile etc) of the user
-- can be "ignored"

Query
-- can be represented in terms of the full context of the query
-- can be represented in terms of the keywords (i.e., a mini-document)
       and "context" properties (e.g. location of the query, current context of the query)

Traditional IR, as we shall see, looks at Keyword representation for the D and Q, largely ignore U.

Given this representation, it makes perfect sense to think of relevance as the "similarity" between D and Q
(and to assess residual relevance in terms of dissimilarity between D and {d1...dk}).

Rao

ps: Regarding "accuracy"/"trustworthiness" aspect, note that traditional IR assumes that the query is being run on a curated corpus where all; so this doesn't need to be modeled. Also, while there is need to model it on the web, my own preference is to separate this from relevance. This is because trust/accuracy is a badge that the user has no way of verifying, while relevance is something they can judge locally.