Consider the 10-document and 6-keyword running example I used in the class about books on databases and statistics. Start with the matrix about the term occurences (the one which says d1 has 24 occurences of t1, 21 of t2 etc.). Assume that the query is Q={database,index} (as in the example in the class). This question essentially asks you to verify that the answers I showed in the class are correct (so you know what your answers are supposed to be. You need to show _all_ the work that lead you to the answer). 1. Assuming that we are using the raw term frequencies as the basis for document vector representation, compute the similarity between the query Q and the documents d5 and d2. 1.b. Do part 1 again, this time viewing the documents are bags (rather than vectors) of words, and using Jaccard similarity. 2. Assuming that we want to use the tf/idf weighting scheme to represent the document vectors. Give the vector representation of d1 and d2. (Note: to get the answers I have, you should *not* normalize the term frequency weight--so the tf of term 1 in doc 1 is 24 and not 24/24; and also take _natural_ logarithm in the IDF weight. In practice, it is a good idea to normalize tf, and it doesn't quite matter which base logarithm you use). 3. Give the similarity between the query Q and documents d5 and d2 if we use the tf/idf representation from 2 for the documents (but the query still uses the standard representation (i.e. <1,0,1,0,0,0>). 4. We continue to use the Tf/IDF representation. For this part you can use the tf/idf representations of the documents d4, d6 and d8 as given in the slides in the class (you don't have to verify them. Suppose you showed d4 and d6 to the user and the user clicked on d6 but not d4. Show how the Rocchio relevance feedback method would change the query vector based on the user feedback. Compute the similarity between this modified query Q' and the document d8. Compare this value to the similarity between Q and d8, and explain if the answer makes sense. 5. Suppose, after part 3, you made the following change to the document d5-- you edit d5, selected all the text, cut it, and pasted it twice and saved it. This effectively doubles everything in d5. Explain which of the following change and how: 1. The raw term frequency representationof d5 2. The similarity between Q and d5 (as measured in terms of the raw term frequency vector) 3. The tf/idf representation of d5 4. The similarity between Q and d5 (as measured in terms of the tf/idf vector representation).