Question 1. K-means on
documents
The question asks you to use bag similarity (also called Jaccard
Similarity) instead of vector similarity. The similarity measure is
defined in terms of the ratio of the cardinality of intersection and
union of a bag. The thing to note here is that the intersection of two bags
(containing multiple instances of iterms, say e1 and e2) is
a bag that contains as many e1 as the minimum of e1 in both bags and
as many e2 as minimum of e2 in both bags.
So B1= 2 e1, 5e2
B2= 4 e1, 2 e2
B1 .intersection B2 = 2 e1, 2 e2
B1 .union. B2 = 4 e1, 5 e2
Cardinality of a bag is of course the number of items in that bag.