# Slides on leaning parameters with complete data

Files are attached.

-jicheng

Attachment: statistical-learning.pdf

\documentclass{article}
\usepackage{./fleqn}
\usepackage{epsf}
\usepackage{../Tex-style/aima2e-slides}
\usepackage[active]{../Tex-style/srcltx}

\begin{document}

\begin{huge}
\titleslide{Statistical learning}{Chapter 20 of AI: a Modern Approach'', Sections 1--3\\Presented by: Jicheng Zhao}

\sf

%%%%%%%%%%%% Slide %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\blob Learn probabilistic theories of the world from experience

\blob We focus on the learning of Bayesian networks

\blob More specifically, input \emph{data} (or \emph{evidence}), learn probabilistic theories of the world (or \emph{hypotheses})

%%%%%%%%%%%% Slide %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\blob Bayesian learning $\Leftarrow$

\blob Approximate Bayesian learning\nl
-- Maximum {\em a posteriori} learning (MAP)\nl
-- Maximum likelihood learning (ML)

\blob Parameter learning with complete data\nl
-- ML parameter learning with complete data in \emph{discrete} models\nl
-- ML parameter learning with complete data in \emph{continuous} models (linear regression)\nl
-- Naive Bayes models\nl
-- Bayesian parameter learning

\blob Learning Bayes net structure with complete data

\note{ {\em (If time allows)} }

\blob Learning with hidden variables or incomplete data (EM algorithm)

%%%%%%%%%%%% Slide %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

View learning as Bayesian updating of a probability distribution\\
over the \defn{hypothesis space}

\mat{$H$} is the hypothesis variable, values \mat{$h_1,h_2,\ldots$}, prior \mat{$\pv(H)$}

\mat{$j$}th observation \mat{$d_j$} gives the outcome of random variable \mat{$D_j$}\\
training data \mat{$\d \eq d_1,\ldots,d_{\Ncount}$}

Given the data so far, each hypothesis has a posterior probability:
\mat{$P(h_i|\d) = \alpha P(\d|h_i) P(h_i)$}%
where \mat{$P(\d|h_i)$} is called the \defn{likelihood}

Predictions use a likelihood-weighted average over all hypotheses:
\mat{$\pv(X|\d) = \mysum_i\ \pv(X|\d,h_i) P(h_i|\d) = \mysum_i\ \pv(X|h_i) P(h_i|\d)$}%
No need to pick one best-guess hypothesis!

%%%%%%%%%%%% Slide %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Suppose there are five kinds of bags of candies:\al
10\% are \mat{$h_1$}: 100\% cherry candies\al
20\% are \mat{$h_2$}: 75\% cherry candies + 25\% lime candies\al
40\% are \mat{$h_3$}: 50\% cherry candies + 50\% lime candies\al
20\% are \mat{$h_4$}: 25\% cherry candies + 75\% lime candies\al
10\% are \mat{$h_5$}: 100\% lime candies

\vspace*{0.2in}

\epsfxsize=0.6\textwidth
\fig{\file{figures}{candy-kinds.ps}}

Then we observe candies drawn from some bag: \epsfxsize=0.3\textwidth \epsffile{\file{figures}{candy-obs.ps}}

What kind of bag is it? What flavour will the next candy be?

%%%%%%%%%%%% Slide %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\vspace*{0.2in}

\epsfxsize=0.9\textwidth
\fig{\file{graphs}{limes-bayes-post.eps}}

%%%%%%%%%%%% Slide %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\vspace*{0.2in}

\epsfxsize=0.9\textwidth
\fig{\file{graphs}{limes-bayes-pred.eps}}

%%%%%%%%%%%% Slide %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{enumerate}
\item {\em The true hypothesis \defn{eventually} dominates the Bayesian prediction} given that
the true hypothesis is in the prior
\item The Bayesian prediction is {\em optimal}, whether the data set be small or large [?]
\end{enumerate}
On the other hand
\begin{enumerate}

\item The hypothesis space is usually very large or infinite\\
summing over the hypothesis space is often intractable\\
(e.g., 18,446,744,073,709,551,616 Boolean functions of 6 attributes)

\end{enumerate}

%%%%%%%%%%%% Slide %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\defn{Maximum a posteriori} (MAP) learning: choose \mat{$h_{{\rm MAP}}$} maximizing \mat{$P(h_i|\d)$}
instead of calculating \mat{$P(h_i|\d)$} for all hypothesis $h_i$

I.e., maximize \mat{$P(\d|h_i)P(h_i)$} or \mat{$\log P(\d|h_i) + \log P(h_i)$}

Overfitting in MAP and Bayesian learning
\begin{itemize}
\item Overfitting when the hypothesis space is too expressive such that some hypotheses fit the date set well.
\item Use {\em prior} to penalize complexity
\end{itemize}

Log terms can be viewed as (negative of)\al
\note{bits to encode data given hypothesis} + \note{bits to encode hypothesis} \\
This is the basic idea of \defn{minimum description length} (MDL) learning

For deterministic hypotheses (simplest), \mat{$P(\d|h_i)$} is 1 if consistent, 0 otherwise\nl
$\implies$ MAP = simplest hypothesis that is consistent with the data % (cf.~science)

%%%%%%%%%%%% Slide %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

For large data sets, prior becomes irrelevant

\defn{Maximum likelihood} (ML) learning: choose \mat{$h_{{\rm ML}}$} maximizing \mat{$P(\d|h_i)$}

(which is reasonable if all hypotheses are of the same complexity)

ML is the standard'' (non-Bayesian) statistical learning method
\begin{enumerate}
\item Researchers distrust the subjective nature of hypotheses priors
\item Hypotheses are of the same complexity
\item Hypotheses priors is of less important when date set is large
\item Huge space of the hypotheses
\end{enumerate}

%%%%%%%%%%%% Slide %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\blob Bayesian learning

\blob Approximate Bayesian learning\nl
-- Maximum {\em a posteriori} learning (MAP)\nl
-- Maximum likelihood learning (ML)

\blob Parameter learning with complete data $\Leftarrow$\nl
-- ML parameter learning with complete data in \emph{discrete} models\nl
-- ML parameter learning with complete data in \emph{continuous} models (linear regression)\nl
-- Naive Bayes models\nl
-- Bayesian parameter learning

\blob Learning Bayes net structure with complete data

\note{ {\em (If time allows)} }

\blob Learning with hidden variables or incomplete data (EM algorithm)

%%%%%%%%%%%% Slide %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\heading{ML parameter learning in Bayes nets}

Bag from a new manufacturer; fraction \mat{$\theta$} of cherry candies?\hfill\epsfxsize=1.25in\raisebox{-1.25in}[0pt][0pt]{\epsffile{\file{figures}{ml-network1.ps}}} \\
Any \mat{$\theta$} is possible: continuum of hypotheses \mat{$h_{\theta}$}\\
\mat{$\theta$} is a \defn{parameter} for this simple (\note{binomial}) family of models

We assume all hypotheses are equally possible {\em a priori}\nl $\Rightarrow$  ML approach

Suppose we unwrap $$\Ncount$$ candies, $$c$$ cherries and $$\ell\eq \Ncount-c$$ limes\\
These are \defn{i.i.d.} (independent, identically distributed) observations, so
\mat{$P(\data |h_{\theta}) = \prod_{j\eq 1}^{\Ncount} P(\datum_j|h_{\theta}) = \theta^c\cdot (1-\theta)^{\ell}$}%
Maximize this w.r.t.~\mat{$\theta$}---which is easier for the \defn{log-likelihood}:
\mat{\begin{eqnarray*}
L(\data |h_{\theta}) &=& \log P(\data |h_{\theta}) = \sum_{j\eq 1}^{\Ncount} \log P(\datum_j|h_{\theta}) =
c\log\theta +  \ell\log(1-\theta) \\
\frac{dL(\data |h_{\theta})}{d\theta} &=& \frac{c}{\theta} -
\frac{c}{c+\ell} = \frac{c}{\Ncount}
\end{eqnarray*}}%
Seems sensible, but causes problems with 0 counts!

{\em This means that if the data set is small enough that some events have not yet been observed, the ML hypotheses assigns zero to those events.} - tricks in dealing with this including initialize the counts for each event to 1 instead of 0.

ML appraoch:
\begin{enumerate}
\item Write down an expression forthe likelihood of the data as a function of the parameter(s);
\item Write down the derivative of the log likelihood with respect to each parameter;
\item Find the parameter values such that the derivatives are zero.
\end{enumerate}

%%%%%%%%%%%% Slide %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Red/green wrapper depends probabilistically on flavor: \hspace*{0.75in}\epsfxsize=2.25in\raisebox{-2.5in}[0pt][0pt]{\epsffile{\file{figures}{ml-network2.ps}}}

Likelihood  for, e.g., cherry candy in green wrapper:
\mat{\begin{eqnarray*}
\lefteqn{P(\J{F}\eq \J{cherry},\J{W}\eq \J{green}|h_{\theta,\theta_1,\theta_2})}\\
& = & P(\J{F}\eq \J{cherry}|h_{\theta,\theta_1,\theta_2})P(\J{W}\eq
\J{green}|\J{F}\eq \J{cherry},h_{\theta,\theta_1,\theta_2}) \\
& = & \theta \cdot (1-\theta_1)
\end{eqnarray*}}%
$$\Ncount$$ candies, $$r_c$$ red-wrapped cherry candies, etc.:
\mat{\begin{eqnarray*}
P(\data|h_{\theta,\theta_1,\theta_2}) &=&
\theta^c (1-\theta)^{\ell} \cdot \theta_1^{r_c}(1-\theta_1)^{g_c}
\cdot \theta_2^{r_{\ell}}(1-\theta_2)^{g_{\ell}}
\end{eqnarray*}}%
\mat{\begin{eqnarray*}
L &=&  [c\log \theta + \ell\log (1-\theta) ]  \\
&+&   [r_c\log \theta_1 + g_c\log(1-\theta_1) ]  \\
&+&   [r_{\ell}\log \theta_2 + g_{\ell}\log(1-\theta_2) ]
\end{eqnarray*}}%

%%%%%%%%%%%% Slide %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Derivatives of \mat{$L$} contain only the relevant parameter:
\mat{$\begin{array}{rclcl} \displaystyle \frac{\partial L}{\partial\theta} &=& \displaystyle \frac{c}{\theta} - \displaystyle\frac{\ell}{1-\theta} = 0 & \qquad \implies & \theta = \displaystyle \frac{c}{c+\ell} \end{array}$}%
\mat{$\begin{array}{rclcl} \displaystyle \frac{\partial L}{\partial\theta_1} &=& \displaystyle\frac{r_c}{\theta_1} - \displaystyle\frac{g_c}{1-\theta_1} = 0 & \qquad \implies & \theta_1 = \displaystyle\frac{r_c}{r_c+g_c} \end{array}$}%
\mat{$\begin{array}{rclcl} \displaystyle \frac{\partial L}{\partial\theta_2} &=& \displaystyle\frac{r_{\ell}}{\theta_2} - \displaystyle\frac{g_{\ell}}{1-\theta_2} = 0 & \qquad \implies & \theta_2 = \displaystyle\frac{r_{\ell}}{r_{\ell}+g_{\ell}} \end{array}$}%
With \defn{complete data}, \emph{parameters can be learned separately}

\emph{Parameters values for a variable only depends on the observations of itself and its parents}

%%%%%%%%%%%% Slide %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Hypothesis: Learning the parameters ($\sigma$ and $\mu$) of a Gaussian density function on a single variable

Data:  given data generated from this distrubution: $x_1, \cdots, x_N$.
\mat{$P(x) = \frac{1}{\sqrt{2 \pi} \sigma} e^{-\frac{(x-\mu)^2}{2 \sigma^2}}$}

The log likelihood is
\mat{$L = \sum_{j=1}^{N} \log \frac{1}{\sqrt{2 \pi} \sigma} e^{-\frac{(x_j-\mu)^2}{2 \sigma^2}} = N(-\log \sqrt{2 \pi} -\log \sigma ) - \sum_{j=1}^{N} \frac{(x_j-\mu)^2}{2 \sigma^2}$}

Setting the derivatives to zero
\begin{eqnarray}
&\Rightarrow& \mu = \frac{\sum_{j} x_j}{N} \nonumber\\
&\Rightarrow&  \sigma = \sqrt{\frac{\sum_j {(x_j -\mu)^2}}{N}}
\end{eqnarray}

%%%%%%%%%%%% Slide %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\twofig{\file{graphs}{regression-model.eps}}{\file{graphs}{regression-sample.eps}}

Maximizing \mat{$\displaystyle P(y|x) = \frac{1}{\sqrt{2 \pi} \sigma} e^{-\frac{(y-(\theta_1 x+\theta_2))^2}{2\sigma^2}}$} w.r.t. \mat{$\theta_1$}, \mat{$\theta_2$}

= minimizing \mat{$\displaystyle E = \sum_{j\eq 1}^{\Ncount} (y_j-(\theta_1 x_j+\theta_2))^2$}

That is, minimizing the sum of squared errors gives the ML solution\\
for a linear fit \emph{assuming Gaussian noise of fixed variance}

%%%%%%%%%%%% Slide %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{itemize}
\item Observation: Attributes of one example
\item Hypothesis: The class this example belongs to
\end{itemize}
All attributes are conditionly independent of each other, given the class.
\mat{$P(C |x_1, \cdots, x_n) = \alpha P(C) \prod_{i} P(x_i|C).$}%

Choosing the most likely class

\begin{itemize}
\item Simple: 2n+1 parameters, no need to search for $h_{ML}$
\item Surprisingly well in a wide range of applications
\item Can deal with noisy data and can give probabilistic prediction
\end{itemize}

%%%%%%%%%%%% Slide %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{itemize}
\item Assume prior as \defn{beta distributions}: \mat{$beta[a,b] (\theta) = \alpha \theta^{a-1} (1- \theta)^{b-1}$}

\note{{\em $\alpha$ is the normalization constant }}

\item Full Bayesian learning
\end{itemize}

\twofig{\file{graphs}{beta1.eps}}{\file{graphs}{beta2.eps}}

Beta distribution:
\begin{itemize}
\item Mean value of the distribution is: \mat{$\frac{a}{a+b}$}
\item larger values of \mat{$a$} suggest a belief that is closer to 1 than to 0
\item larger values of \mat{$a+b$} make the distribution more peaked (greater certainty about $\Theta$)
\item \emph{if the prior of $\Theta$ is a beta distribution, after a data point is observed, the posterior distribution of $\Theta$ is also a beta distribution}
\end{itemize}
\mat{
\begin{eqnarray}
P(\theta | D_1 = cherry) &=& \alpha P(D_1 = cherry | \theta) P(\theta)\nonumber\\
& = & \alpha' \theta \cdot beta[a,b] (\theta) = \alpha' \theta \cdot \theta^{a-1} (1 - \theta)^{b-1} \nonumber\\
& = & \alpha' \theta^a (1 - \theta)^{b-1} = beta[a+1, b](\theta)
\end{eqnarray}
}
The distribution is converging to a narrow peak around the true vale of $\Theta$ as data comes in.

For large data set, Bayesian learning converges to give the same results as ML learning.

%%%%%%%%%%%% Slide %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\heading{Parameter learning in Bayes nets (contd.)}
\vspace*{0.2in}

\epsfxsize=0.5\textwidth
\fig{\file{graphs}{bayesian.eps}}

\mat{$P(\Theta, \Theta_1, \Theta_2)$}
Usually, we assume \emph{parameter independence}. Each parameter has its own beta distribution.

%%%%%%%%%%%% Slide %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\blob Bayesian learning

\blob Approximate Bayesian learning\nl
-- Maximum {\em a posteriori} learning (MAP)\nl
-- Maximum likelihood learning (ML)

\blob Parameter learning with complete data\nl
-- ML parameter learning with complete data in \emph{discrete} models\nl
-- ML parameter learning with complete data in \emph{continuous} models (linear regression)\nl
-- Naive Bayes models\nl
-- Bayesian parameter learning

\blob Learning Bayes net structure with complete data $\Leftarrow$

\note{ {\em (If time allows)} }

\blob Learning with hidden variables or incomplete data (EM algorithm)

%%%%%%%%%%%% Slide %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%% Slide %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\blob Bayesian learning

\blob Approximate Bayesian learning\nl
-- Maximum {\em a posteriori} learning (MAP)\nl
-- Maximum likelihood learning (ML)

\blob Parameter learning with complete data\nl
-- ML parameter learning with complete data in \emph{discrete} models\nl
-- ML parameter learning with complete data in \emph{continuous} models (linear regression)\nl
-- Naive Bayes models\nl
-- Bayesian parameter learning

\blob Learning Bayes net structure with complete data

\note{ {\em (If time allows)} }

\blob Learning with hidden variables or incomplete data (EM algorithm)

%%%%%%%%%%%% Slide %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\heading{Learning with hidden variables or incompte date}

%%%%%%%%%%%% Slide %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Full Bayesian learning gives best possible predictions but is intractable

MAP learning balances complexity with accuracy on training data

Maximum likelihood assumes uniform prior, OK for large data sets

1. Choose a parameterized family of models to describe the data\\
\note{{\em requires substantial insight and sometimes new models}}

2. Write down the likelihood of the data as a function of the parameters\\
\note{{\em may require summing over hidden variables, i.e., inference}}

3. Write down the derivative of the log likelihood w.r.t. each parameter

4. Find the parameter values such that the derivatives are zero\\
\note{{\em may be hard/impossible; modern optimization techniques help}}

\end{huge}
\end{document}