[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Eucledian, Cosine and Jaccard similarity messure



Hi all,

In the course slides there is an example to show that Eucledian distance 
performs less effectively in seperating the documents (slide #58). Please note 
that this happens partly because the regular Eucledian distance method does 
not have normalization on the vectors, while cosine distance take the norm of 
the vectors into account and thus get rid of the influence of the size of the 
vectors. 

Acctually if we normalize the vectors before we calculate Eucledian distance, 
it performs almost as good as cosine distance. 

On the other hand, Jaccard similarity performs reasonably well, normalized or 
not (by its definition, the length of the vectors are indirectly taken into 
account already...), although it looks that in my testing the cosine 
similarity is doing a little better.

Please see the attached picture for a comparison of the dicussion above. The 
example uses the same 10 documents as used in the course slides.

Cheers,
Jianchun

image/pjpeg