[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Something to think about: Similarity/Distance Metrics



We saw several similarity metrics for documents these last two classes. A question is will any old
function be considered a similarity function or whether there are some minimum disiderata. 

You should keep in mind that similarity is really a measure of "distance"--the more distant two
objects are from each other, the less similar they are.

This brings us to desirable properties of distance (and thus similarity) metrics.

The typical required properties of a distance metric are given below (from wikipedia entry).

You want to check to see if each of the metrics we have seen until now--jaccard, euclidean, cosine-theta--satisfy all these properties. 

In the literature, people do consider "distance" functions that do not satisfy all these properties. Typically, the first one to be sacrificed is 
triangle inequality property.  Sometimes even the symmetry property is given up. An example of a distance metric that is asymmetric is 
the distance metric between two probability distributions--called KL-divergence (see  http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence )

metric on a set X is a function (called the distance function or simply distance)

d : X × X → R

(where R is the set of real numbers). For all xyz in X, this function is required to satisfy the following conditions:

  1. d(xy) ≥ 0     (non-negativity)
  2. d(xy) = 0   if and only if   x = y     (identity of indiscernibles. Note that condition 1 and 2 together produce positive definiteness)
  3. d(xy) = d(yx)     (symmetry)
  4. d(xz) ≤ d(xy) + d(yz)     (subadditivity / triangle inequality).

Rao