[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Core-less apples and other curiosities of high dimensions




(I seem to be on a roll this morning...)

K-means style approaches are called eucledian-distance based
clustering algorithms--in that they depend on the pairwise eucledian
distance measures of the data points. 

One of the biggest problems in doing clustering of documents using
eucledan distances (such as is done by K-means algorithm) is that
there are typically too many dimensions (keywords) describing these
documents. Most clustering ideas that look reasonable in low
dimensions wind up becoming quite useless in high-dimensions.

To understand this, we need to understand what goes awry in the
high-dimensions. Here are some non-intuitive facts:

1. Data in high dimensional space tend to be extremely sparse. If you have 20
   million docs, you might think that is a whole lot of data points to
   do clustering with. But if you have 20 or 30 dimensional space, 20
   million point in that space is really nothing. There is a very high
   probability that they will all be equidistance from each
   other. This means there will not be any natural distance based
   clusters.

An interesting way of illustrating this is to think of an
n-dimensional apple (say sphere).

Suppose we have an n-dimensional sphere of radius r. Suppose we remove
a small peel of epsilon thickness from this sphere (apple). What do
you think is the ratio of the remaining volume of the sphere to the
original volume of the sphere?

You would think it will be very close to 1, right?

Think again. As the number of dimensions, n, tends to infinity, this
ratio tends to zero. Another way of putting it is that a
high-dimensional apple is all peel and no core ;-) 

(you can prove this fairly easily. We know that a 2-D sphere has
volume (area) proportional to r^2, a 3-d sphere has volume
proportional to r^3, so an n-d sphere has a volume proportional to
r^n. 
If we assume that we removed a peel of thickness e, we are considering
the ratio

[(r-e)/r]^n

Since r-e is smaller than r, if you continue exponentiating it, it
will ultimately go to zero.. 

for r=1, and e= .01, the ratio comes to

97% for 3-d
81% for 20-d
36% for 100-d
0.004% for 1000-d

Recall that it is not all that surprising to find 1000-d, since in
information retrieval, we might have several thousands of keywords...

A somewhat non-obvious corollary of this above is the following:

2. two random vectors in high dimensional space are almost always orthogonal to
   each other (i.e., their cosine-similarity value will be zero) with
   high probability. This too means that there will be few natural
   clusters we can find in general data.


There are various ways of getting around this curse of high-dimensions
of course (this is a hot research area). One idea might be to first do
an LSI on the data, project the data into a few of the most important
dimensions, and cluster them in that space. This, approximately, is what
Manjara tries to do.

Rao