Consider the 10-document and 6-keyword running example I used in the
class about books on databases and statistics. Start with the matrix
about the term occurences (the one which says d1 has 24 occurences of
t1, 21 of t2 etc.). 

Assume that the query is Q={database,index} (as in the example in the
class). 

This question essentially asks you to verify that the answers I showed
in the class are correct (so you know what your answers are supposed
to be. You need to show _all_ the work that lead you to the answer). 

1. Assuming that we are using the raw term frequencies as the basis
   for document vector representation, compute the similarity between
   the query Q and the documents d5 and d2.  

1.b. Do part 1 again, this time viewing the documents are bags (rather
than vectors) of words, and using Jaccard similarity. 

2. Assuming that we want to use the tf/idf weighting scheme to
   represent the document vectors. Give the vector representation of
    d1 and d2. (Note: to get the answers I have, you should *not*
   normalize the term frequency weight--so the tf of term 1 in doc 1
   is 24 and not 24/24; and also take _natural_ logarithm in the IDF
   weight. In practice, it is a good idea to normalize tf, and it
    doesn't quite matter which base logarithm you use). 
 
  
3. Give the similarity between the query Q and documents d5 and d2 if 
   we use the tf/idf representation from 2 for the documents (but the
   query still uses the standard representation (i.e. <1,0,1,0,0,0>).


4. We continue to use the Tf/IDF representation. For this part you can
   use the tf/idf representations of the documents d4, d6 and d8 as given
   in the slides in the class (you don't have to verify them.  Suppose
   you showed d4 and d6 to the user and the user clicked on d6 but not
   d4.  Show how the Rocchio relevance feedback method would change the
   query vector based on the user feedback. Compute the similarity
   between this modified query Q' and the document d8. Compare this value 
   to the similarity between Q and d8, and explain if the answer makes 
   sense. 


5. Suppose, after part 3, you made the following change to the
   document d5-- you edit d5, selected all the text, cut it, and
   pasted it twice and saved it. This effectively doubles everything
   in d5.  Explain which of the following change and how:
   1. The raw term frequency representationof d5
   2. The similarity between Q and d5 (as measured in terms of the raw term
      frequency vector)
   3. The tf/idf representation of d5
   4. The similarity between Q and d5 (as measured in terms of the
      tf/idf vector representation).