[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Thinking Cap qns on Correlation analysis..
- To: "Rao Kambhampati" <rao@asu.edu>
- Subject: Thinking Cap qns on Correlation analysis..
- From: "Subbarao Kambhampati" <rao@asu.edu>
- Date: Tue, 9 Sep 2008 17:57:43 -0700
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:sender :to:subject:mime-version:content-type:x-google-sender-auth; bh=kH3OzyQhM9hN/ptrUPn3LegWgBN3tvLP7WbY8v37oSo=; b=RD2EPh8goR6u/GPDGN1vckBwOCjioHhw4NojSdYO+rq3+7334yxupXeASKuMRrPYcB DLElWl7MtBBdpVZzCQiTGvnAyK7cAz4s/Sv9JjRYDO1q2BegC2wEozH3AuiYSsmn4z5N Nlr29BYnqbXVSgG1KPPmZVglvLvzMcV+RFioM=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:sender:to:subject:mime-version:content-type :x-google-sender-auth; b=eWwS8BrBrho3ldTG9K8oVUotXyPRmfKObAthD/15cyuqV2RRgd/7z7HdW5WEtG0LM7 OJVf6XZ6GKCpOliAbdqm44xjSlAg6pmDlksBr6fI/RE3zOCP2HSCKb2xMvs3RZa6vTXv 24ZgTLlc/CbT4VmwabZ+YnBLtoTHbXrMNwRRI=
- Sender: subbarao2z2@gmail.com
0. (think for yourself; we will discuss this next class)--what is "rank" of a matrix? what happens to the rank of a matrix if you pad it with rows that are linear combinations of its existing rows, and pad it with columns that are linear combinations of existing columns? What could this possibly have anything to do with LSI and dimensionality reduction?
1. Give a scenario where two words may be close in the context of a query, but not close over the whole document corpus (and vice versa).
What does this tell us about global vs. local thesauri construction?
2. We mentioned that LSI is a special case of LDA. In particular, the most important dimension, as unearthed by LSI, may not be the most important
dimension to separate relevant vs. irrelevant docs w.r.t. a query. Given this, could we make any justification for LSI analysis (as against LDA?)
3. I mentioned that LSI does not capture "nonlinear" dependencies, and mentioned "circular" fish data (i.e., a bunch of artificial fish, which, when plotted in the length/width space, will look like a circle (or, if you prefer, a very nice cubic curve or a sine wave etc.).
3.a. Since we don't quite know up front whether or not the data under consideration has non-linear dependecies among its dimensions, what is the "risk" we are taking in applying LSI?
3.b.[Tricky] Consider the following off beat idea. A nonlinear dependency may still be viewed as a "linear" dependency between non-linear dimensions. This suggests the following highly quixotic idea for dimensionality reduction:
First *explode* the dimensionality by adding all possible higher order combinations of base dimensions (if you start with x and y dimensions, consider
making up new terms such as x^2, y^2, x*y etc)
Now, do LSI in this (vastly higher dimensional) space; pick the top k most important dimensions.
Suppose I do the above for only up to nth order terms (i.e, terms of type x^k y^(j-k) with j less than or equal to n+1). Can you see this idea actually doing
dimensionality reduction at least conceptually? What do you see as the main computational hurdle?