[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: an example in today's lecture




cwen> Prof Rao,
cwen> 
cwen> You talked about an example today involving three doc's: one 
cwen> containing "driver", the second one containing "driving" and the third 
cwen> one "abdominal pain". I didn't quite get the main points regarding this 
cwen> example. Could you please clarify for me?
cwen> 
cwen> Thanks
cwen> 
cwen> Catherine


the point is that although driver and driving are different terms with 
different meanings, the meaning differential between them is
overshadowed by the difference either of them have with the third
keyword "abdominal pain". 

Suppose you have some documents made up of just these words

By acting as if "driver" and "driving" are same, you will wind up
changing the "actual meaning" of the documents that contain those
words. 

But, in as much as you are only interested in answering queries of the
user (or equivalently to classify documents in your corpus into
clusters), this loss of meaning may not matter--especially if your
queries are broadly classified into automotive vs. gastro-intenstinal.

---Continuing to elaborate... (and going into deeper issues)

The basic distinction I am making is that between Regression
(compression) and Classification. The objective in
regression/compression is to find a representation for the data that
is very faithful to the data, but hopefully takes less space. When you 
fit a bunch of b-splines to approximate a curve, you are doing
"regression". The objective of classification is to classify the data
into "classes". If the number of classes is much less than the number
of data points (as is usually the case), then you may find a way of
representing the data that is 100% accurate with respect to
classification even though it is pretty bad with respect to regression 
(reconstruction of the data).

Classification is related to regression in that one way of classifying
data is to first find a compressed representation of the data
(e.g. find features that closely describe the data). However,
ultimately classification will never be asked to reconstruct original
data, and thus it can use more effective compression techniques.  As a
classic example consider the following scenario. We have a set of
color photographs.

1. If we want to store a compressed representation of them that
doesn't change their content too much, we use discrete cosine
transform (jpeg), fourier transform etc--taking care to ensure that we 
just give up some high frequency components in the transform domain. 

2. If instead we just need to classify them into (say) two
   classes--those that are bright and those that are dark, a single
   number, such as average intensity of the picture--will be
  enough. Notice that  with average intensity value, we can do 100%
correct classification, although it is  a lousy regression
(reconstruction) technique--since you can just make a dark or light
blob as the reconstruction of the picture. 

----Coming to LSI

LSI scheme can be seen as a systematic  compression scheme, which
allows you to reduce the dimensionality of the t-d matrix by getting
rid of useless and noisy dimensions. 

If that is the only way you see LSI, then you cannot justify getting
rid of singular values unless they are *significantly* smaller than
the remaining ones. 

In the example I showed in the class, the singular values are not
really falling off too drastically. This is why, if you just took 2
highest singular values, the reconstructed matrix is pretty far from
the old one.  Fortunately however, the fact that the reconstruction is 
far from original doesn't necessarily mean that query retrieval using
the reduced (highly lossy) space is going to be equally bad (as
illustrated by the photograph example above). 


Hope this helps  ;-)
Rao
----
My student aksed me for time, and I explained her how to make
a watch...
  -Chris in the morning; Northern Exposure...