[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
A clarification re: starting with Term-Doc vs. Doc-Term matrix...
Folks:
It seems that I created some confusion by using two different
conventions for term-document matrix in the examples I used:
The convention used in the Deerwester paper as well as Berry's paper
is that M is a matrix with
--Terms as rows and
--Docs as columns
making it a Term-Document matrix
The database--regression example that I discussed in todays class
(as well as earlier), uses the convention of
--Docs as rows
and
--Terms as columns
which makes it a Doc-Term matrix
Of course, these two matrices will be just transposes of each
other--and their SVD essentially involves interchanging U and V
matrices.
---The conventions however do introduce slight differences in the way
you interpret the SVD
Since the semantics of U,S,V change depending on whether you are
considering D-T or T-D matrix, it is good to think in terms of three
matrices:
doc-factor matrix: which decomposes documents into the factor directions
term-factor matrix: which decomposes the terms into factor directions
factor-factor matrix: which gives the scaling factors for the factor
dimensions
Once you know these,
The coordinates of the new axis vectors (w.r.t. old term dimensions)
are given by Term-Factor matrix
The coordinates of the documents w.r.t. new axes is given by:
doc-factor X factor-factor
if you start with doc-term matrix, as in the database example, then SVD gives
doc-term = doc-factor X factor-factor X (term-factor)'
if you start with the term-doc matrix, as in the medline example
term-doc = term-factor X factor-factor X (doc-factor)'
Once you know which matrices are which you know what are the term
directions and which are the document coordinates, and how to convert
the query into the new axis.
Sorry for the confusion.
regards
Rao
[Sep 11, 2002]