[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

A clarification re: starting with Term-Doc vs. Doc-Term matrix...



Folks:
  It seems that I created some confusion by using two different
conventions for term-document matrix in the examples I used:

The convention used in the Deerwester paper as well as Berry's paper
is that M is a matrix with 

--Terms as rows  and
--Docs as columns

making it a Term-Document matrix


The database--regression example that I discussed in todays class
(as well as earlier), uses the convention of 

--Docs as rows
and 
--Terms as columns

which makes it a Doc-Term matrix

Of course, these two matrices will be just transposes of each
other--and their SVD essentially involves interchanging U and V
matrices.

---The conventions however do introduce slight differences in the way
you interpret the SVD 

Since the semantics of U,S,V change depending on whether you are
considering D-T or T-D matrix, it is good to think in terms of three
matrices:

doc-factor matrix: which decomposes documents into the factor directions
  
term-factor matrix: which decomposes the terms into factor directions

factor-factor matrix: which gives the scaling factors for the factor
                       dimensions

 Once you know these, 
 
 The coordinates of the new axis vectors  (w.r.t. old term dimensions) 
are given by Term-Factor matrix

 The coordinates of the documents w.r.t. new axes is given by:
    doc-factor X factor-factor


if you start with doc-term matrix, as in the database example, then SVD gives

     doc-term = doc-factor X factor-factor X (term-factor)'


if you start with the term-doc matrix, as in the medline example

   term-doc = term-factor X factor-factor X (doc-factor)'


Once you know which matrices are which you know what are the term
directions and which are the document coordinates, and how to convert
the query into the new axis. 

Sorry for the confusion.

regards
Rao
[Sep 11, 2002]