[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

probabilistic programming resource(s)+ Collapsed Gibbs cliffs notes..



Here is a site that tracks the recent progress on probabilistic programming (it also links to the NIPS workshop site I showed in the class):

http://probabilistic-programming.org/wiki/Home

The site lists some initial frameworks that are already available. There isn't, as yet, any agreed upon single general framework. Different ones make
different tradeoffs in terms of the models they support and the priors they allow. See the framework called Church, for example at
http://projects.csail.mit.edu/church/wiki/Church

Like I said this is still a very active area, and part of the issue is that inference over graphical models is not always tractable, and this tractability 
issues span different levels. 

The easiest case is when the posterior is completely analytically expressible (i.e., it can be expressed in closed form solution in terms of known elementary functions). 
 For example, the bayesian inference/learning over the coin toss example had the nice property
that if you started with beta, then the posterior is just another beta with the hyper-parameters that are incremented appropriately by the samples seen. 

The next difficulty arises when the posterior doesnt have a simple closed form solution (the the literature these are called "intractable"). 
That is the case with LDA. However, the posterior can be approximated via
sampling methods such as MCMC. To do this however, the gibbs sampling distribution (P(node| its markov blanket) should have a analytical (closed form) solution. This is the 
case for LDA--since with collapsed gibbs sampling, we had a way of computing the P(zi=k| rest of z, and words).

The next level of difficulty comes when not only is the posterior intractable, the gibbs sampling distribution is also intractable (i.e., has no closed form solution). This happens for example if you have bayes networks with more complicated structure than LDA--where, nodes have multiple parents, and the parents are hidden. A case in point is deep belief networks. 
[It may be surprising to hear that LDA is "simple"--there are two orthogonal notions of complexity of a graphical model--how "BIG" it is in terms of the number of nodes in the network, and how complex is its topology (in terms of the number of parents any given node has). LDA is big in terms of number of nodes, but it is pretty simple in terms of topology--the maximum number of parents any node has is two--words depend on topic , and topic distribution. Even this is not really two parents but one parent indexed by the other parent.. So if you see statements in the literature that LDA is a simple graphical model, you know they are talking about the topology.

=======

If you are interested in understanding the derivation of the collapsed gibbs sampling, the following limited cliffs notes might help:

Hints on the collapsed gibbs sampling derivation in the Heinrich paper on Parameter Estimation for Text Analysis
(URL: http://faculty.cs.byu.edu/~ringger/CS601R/papers/Heinrich-GibbsLDA.pdf

The collapsed Gibbs sampler for LDA is in section 5.5. 

In Equations 66-68, and 70-73  he first integrates out the two continuous variables, getting p(z|alpha) and p(w|z,beta). 

In doing so, he uses the properties of dirichlet integrals. The resulting distribution P(w|z) is in terms of delta terms, which themselves are defined in terms of gamma function terms (as given in Equation 40). 

Note that when you marginalize the theta (which intercedes between alpha and z, you make z's, which were conditionally independent given theta, correlated. so it is important to note that the graphical model corresponding to the marginalized distribution doesn't look like the simple LDA plate model--you in essence have to put edges between Zs).

Next in equations 74-78, he computes the gibbs sampler distribution for each zi in terms of the other zi's and the words. Here he doesn't assume any conditional independences between Zs, but rather uses bayes rule directly. 74 gives him the required distribution in terms of the marginalized joint over w,z that he computed in equantion 73

Eq 74 is them simplified, using gamma function properties, and you find out that the distribution can be computed just in terms of counts. This is not always possible for all models. Part of the real reason for LDA's popularity is that it actually is possible to do collapsed gibbs so well on it..


Finally, given the posterior p(z|w), he recovers p(theta|z) in equations 81-82. 

Rao