[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: scalar clustering and co-currence




elvis> I'm wondering about the equation
elvis> 
elvis> n*(t1/n*t2/n) = m  or close to m...
elvis> 
elvis> I see THAT it works but I don't see exactly WHY.
elvis> what is this relationship t1/n*t2/n ?
elvis> To clarify in english (sorry but it's my main lang. ; ) )
elvis> why is it that the percentage of docs that contain t1 * the percentage that
elvis> contain t2 is approx equal to the percentage that contain both??? The
elvis> product confuses me..... grrr
elvis> 

elvis> 
elvis> 

think in terms of probabilities. 
Let P(T1) be the probability that a random document contains t1.
Clearly P(T1) = t1/n (if we have a sufficiently large set of documents 
n)

Similarly P(T2) = t2/n

We want to compute the probability 

P(T1 & T2)

If T1 and T2 are independent--that is they appear independently of
each other and have no correlation, then

P(T1 & T2) = P(T1) * P(T2)
           = t1/n * t2/n

Thus, when T1 and T2 are independent, then the fraction of docs
containing both is just the product of the fractions of docs
containing either. 

If they are not independent, the we know that the more general
rule is needed

P(T1 & T2) = P(T1|T2) * P(T2)   = P(T2|T1) * P(T1)

This general rule take the correlations into account.

Specfically, P(T1|T2) = P(T1) if T1 is independent of T2

                      < P(T1) if appearance of T2 reduces the
                               probability of appearance of T1
                               (negatively correlated)

                     > P(T1) if appearance of T2 increases the
                              probability of of appearance of T1

Rao