[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
(re-sent) [Thinking Cap] on Latent Semantic Indexing
- To: Rao Kambhampati <rao@asu.edu>
- Subject: (re-sent) [Thinking Cap] on Latent Semantic Indexing
- From: Subbarao Kambhampati <rao@asu.edu>
- Date: Tue, 16 Feb 2010 21:07:43 -0700
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:received:date :x-google-sender-auth:message-id:subject:from:to:content-type; bh=aRWKBoyu535HZlQfHnEI8ihdtnCJcbklkwr6StlXdcY=; b=g+2x1CI91gnjAZWUojujL1OF/L6YM0Llq9QG8S4xeK784CkcMtShaJbo8CdgzmJnlP mYasd7ujZ5FFf/r5b9MGPjFqMlVfZT2HhX6sfBNd0j2ZoeT0vuiGweAjnRLpnjBz3vee QtSkoFahBKl8We/1h5rDI92GxU2IogtVxSRC4=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:date:x-google-sender-auth:message-id:subject :from:to:content-type; b=V7QB5xsUxmk2wA4470Ao1ATVxJp5PlUcnhc6YHAwG+S/+dt9rsiwCDIUC0X7D734qK ESzCfA7PSqXjWlzc63fczhVs+nrgTmY14477pMs71GVfJmFrFSkGXc7CQUpEnDy/q8d6 IuS/dGMiPfoBaivnG/Hx1x8sAu82HQDRX9BXs=
- Sender: subbarao2z2@gmail.com
[Sorry--the previous version got sent prematurely. Here is the correct one]
0. We have 100 documents that are all described as vectors in the space of 1000 keywords. What is the largest number of non-zero singular values this document-term matrix can have?
1. suppose we have two documents d1 and d2 whose cosine similarity in
the original space is 0.6. What is their cosine similarity in the
factor space (i.e. df*ff representation)
when
1.1. We decide to retain *all* dimensions
1.2. We decide to retain just one dimension
2. We considered the "malicious oracle" model where the true documents were distorted by (a) introducing fake terms corresponding to linear combinations of real terms (b) adding noise to these fake terms (so they are not exact linear combinations) and (c) removing the original "true" terms. To what extent is LSI analysis able to recover the original data thus corrupted? Speak specifically of how LSI handles a, b and c parts.
3. We have documents (data) that are described in the space of 3 keywords. It turns out that the specific documents that we have all fall on a 2-d plane. What does LSI on this data tell you? 3.1. In the previous case, suppose the data that we have forms a 2-D paraboloidal surface. What does LSI does for this case?
rao