-----------------------------------------------------------------------

Sequential Decision Problems and 
Markov Decision Processes 

Notes by: Leanne Vander Meer Dunn
Revised by: Rao [Dec  5, 1995]

Sources: Artificial Intelligence - A Modern Approach
   Stuart Russell and Peter Norvig
   Chapters 16, 17, 20

         Planning and Learning Class Lecture
         November 29 and December 4, 1995

-----------------------------------------------------------------------

This paper describes how to solve a sequential decision problem.
However, before we can solve it, we need to know what it is.  So
first, some background material.

Definitions
-----------

A sequential decision problem is one where the agent has a set of
world states it can inhabit, a set of actions it can take in any world
state, and there is a specific reward for being in any state. 

A policy is a mapping between states and actions.  Given a state, the 
policy tells the agent what action to perform.

The objective of SDP is to find the optimal policy--i.e., the policy
which gives the maximum accumulated reward for the agent. 

A Markov decision problem (MDP) is the problem of calculating an optimal 
policy in an accessible (observable), stochastic environment with a 
transition model that satisfies Markov property (i.e., the transitions
depend only only the current state, and not the states that the agent
visited on its way to this state). 

Motivation ---------- 

In previous class discussions, we have talked about ways to relax the
accessibility assumption, the instantaneous action assumption, and the
deterministic action assumption. However, the "goals" of the planner
have remained the same.  We still have achievement goals and
maintenance goals.  But what if you need to represent a more
expressive goal?  For instance, you might want a plan that takes the
smallest possible number of actions to reach a given state.  Or you
might want a plan that that reaches its goal without ever passing
through a "bad states".

To support such goals, we first need to have a methodology for
checking if a plan "satisfies" them.  We briefly talked about value
functions in our discussions concerning Dynamical Systems.  Now it's
time to put them to work.

Value Functions --------------- 

As previously discussed, a dynamical
system takes its current state and an action as input, and produces
another state.  This new state can then be passed back into the next
pass of the dynamic system. 

In so doing, the dynamical system exhibits a state sequence (also
known as a history, a behavior).


A value function assigns a real-valued number to a state sequence. 

        V(<s1, s2, ... , sn>) = R

A goal-of-achievement value function might look like this:

                                |  1 --> G is true in sn
        V(<s1, s2, ... , sn>) = |
                                |  0 --> G is not true in sn


A more general value function can depend on all the states in the
state sequence (instead of just on the goal).

Given a dynamical system D, and a value function V, you can assess the
performance of the system by essentially letting it run, consider the
state sequences it produces, compute their values, average the values
over a bunch of runs of D. 

We know that the plan/policy of the  dynamical system changes the
state sequences that the system will exhibit. Thus, the objective of
planning becomes coming up with a plan/policy under which the assessed
value of the dynamical system will be maximal. 

One brute-force way of finding the optimal plan will thus be to come
up with all different plans, compare them in terms of the value of the
dynamical system under that policy, and take the plan that gives rise
to the best value. 

So where do we get a value function?  We need a way to convert the
goals in to a value function.  We can do this indirectly in terms of
"Reward" functions.  A reward assigns a real-valued number to a state
(rather than a state sequence).  Here's an example of a reward for
maintaining an attribute g in state s:

               |  -1  if g does-not-hold in s
        R(s) = |
               |   0  if g does-hold in s 

In the above example, if you violate the requirement that you maintain a 
particular attribute, you get a large negative reward. 

Once we have reward functions, we can compute value functions in the
obvious way:

  V (<s1,s2,.....sn>) = R(s1) + R(s2) + .. R(sn)

  To handle infinitely long sequences of states, and also to discount
future promise with respect to past achievement, typically we consider
a discounted value function:

  V(<s1,s2,....sn>)  =   SUM   r^i R(si)   (where 0<= r < 1)
                        i=1..n

-------
REWARD FUNCTIONS RESTRICT OUR ATTENTION TO A SPECIFIC CLASS of VALUE FUNCTIONS

There are some important assumptions inherent in reward
functions. Specifically, since the reward for each state is defined
independent of the other states, value functions defined in terms of
reward functions have the "SEPARABILITY" property--that is the value
function of a state sequence <s1...sn> will be the sum of the value
function on <s1 .. sm> and value function on <sm+1 .. sn>. 

NOT ALL VALUE FUNCTIONS SATISFY SEPARABILITY PROPERTY. As an example, 
suppose your value function is such that:

V(<s1 s2>) = a

V(<s1>) = 0

V(<s2>) = 0
---------------

Reward functions also allow us to combine goals in a natural way. 

Here's an example where you want to achieve either g1 or g2 as ealy as
possible. What is the reward for state s?

        R(s) = MAX[ R(s,g1), R(s,g2) ]


---------------


Markov Decision Problems
------------------------


A dynamical system specified in terms of a transition model (ie., a
model which gives, for each state s and action a, the distribution of
states that will result if a was executed in s), and a reward function
is called a Markov Decision Process. 

The objective in MDP is to find the optimal policy for the dynamical
systems. 

An MDP requires, as input:

        1) a set of states,
        2) a set of actions, and
        3) a reward function.

The reward function tells us what reward to expect when in a given 
state. 

The actions are represented in terms of transition probabilities from 
one state to another state.  For example, when in state s1, action a1 
might have probability 0.9 of taking you to state s2.

Example:

I'll explain how to solve a Markov decision problem through a series of 
examples, all using the same environment. Introducing grid world: 

  +-------+-------+-------+
  |       |       |       |
2 |       |       |  +1   |
  |       |       |       | 
  +-------+-------+-------+
  |       |       |       |
1 |       |       |  -1   |
  |       |       |       |
  +-------+-------+-------+
  |       |       |       |
0 |       |       |       |
  |       |       |       |
  +-------+-------+-------+
	      0       1       2   

Each square is a state.  The squares with +1 and -1 are termination 
states.  You want to reach the +1 state and stay away from the -1 state. 
 So, somehow, you need to know what to do in each of the others states.  


In this environment, you have actions: up, down, left, right, where each 
action moves the agent one position in the specified direction.  
However, we want to learn about handling uncertainty, so let's make the 
outcomes non-deterministic.  Lets say that there is an 80% chance that 
the operator will move the agent in the specified direction, and a 20% 
chance that the operator will move the agent at a right angle to the 
specified direction.  For instance, if the agent executes the action up, 
then the agent has an 80% chance of moving up, a 10% change of moving 
left, and a 10% chance of moving right.  These probabilities describe a 
transition model. 

In a problem like this, the utility of a state really depends on where 
you can get from that state.  Therefore, the utility of the state 
depends on the utilities of a sequence of states (an environment 
history). 

A policy is a complete mapping from states to actions.  You can 
calculate a policy from a transition model (the probabilities) and a 
utility function.  Once you have a policy, then the agent simply does 
what the policy says for a given state. 

In an accessible environment, the agent's observations at each step will 
identify the current state. 

In an inaccessible environment, the agent's observations do not provide 
enough information to identify the current state or the associated 
transition probabilities. 

As stated above, a Markov decision problem (MDP) is the problem of 
calculating an optimal policy in an accessible, stochastic environment 
with a known transition model. 

The Markov property is said to hold if the transition probabilities from 
any given state depend only on the state and not on previous history. 

A partially observable Markov decision problem (POMDP) is the problem of 
calculating an optimal policy in an inaccessible environment.  These 
problems generally don't use the same kinds of solutions as the MDPs, 
and this paper will not cover them. 

We are interested in MDPs, so the environment is accessible and the 
transition model is known. 


Given an MDP and a reward function, we can calculate an optimal
policy. Specificaly, an optimal policy directs the agent towards the
next possible state that has the highest promise i.e., will lead to
state sequnces whose expected value will be the highest.  The promise
of a state is called the utility of the state.  To understand the
utility of a state, and how it fits in to an MDP, we need a little
more background.


-----------Digression onto utility theory-----------

Decision Theory
---------------
When making decisions, you not only need to know what is a good effect 
and what is a bad effect, but also how likely it is to happen.  Decision 
theory describes how to combine utility theory with probability so that 
an agent can select actions that will maximize its expected performance. 


Probability theory describes what an agent should believe on the basis 
of evidence.

Utility theory describes what an agent wants.

So, decision theory lets an agent figure out what do to based on 
probability and utility.  This leads to a discussion on utility theory.  


Utility Function
----------------
An agents preference between world states is represented by a utility 
function, which assigns a number to each state to represent the 
desirability of the state.  An actions can also have a utility.  You 
combine the state utility with the possible outcomes of the action to 
get an expected utility for the action. 

The expected utility is defined as:

EU(A) = SUM [ P(O(A,i)) * U(O(A,i)) ]

This is the expected utility of action A given the agents current
situation. Following is a detailed explanation of this equation. 

U(S) is the utility of state S (the utility is a number).  For now, just 
assume that you have these values. In the above equation, the utility 
refers to the state that the agent will be in after the action.

O(A,i) is the i-th possible outcome of action A. This refers to a state. 
 For example, if the agent moves a step forward, the agent might stay in 
the new position, or the agent might slide backwards to its original 
position (maybe if it was on a slope).  So there are two possible 
outcome states in this case. 

Continuing with the example, the action move from position P0 to 
position P1 could result in the agent being in either position P0 or in 
position P1. 

	O(move,1) = P1 	-->	Agent Stays In New Position
	O(move,2) = P0    -->	Agent Slides Backwards

Remember that U(S) is the desirability of being in state S (according to 
the agent).  For this example, lets say: 

	U(O(move,1)) = U(P1) = 0.88
	U(O(move,2)) = U(P0) = 0.76

P(O(A,i)) is the probability of action A resulting in outcome i. 
So, given the above example, there might be an 80% chance that the agent 
will stay in the new state, and a 20% chance that the agent will slide 
backwards to the old state. 
=09
	P(O(move,1)) = P(P1) = 0.80
	P(O(move,2)) = P(P0) = 0.20

And finally, putting it all together, the expected utility is:

EU(A) = SUM [ P(O(A,i)) * U(O(A,i)) ]

	EU(move) = P(O(move,1)) * U(O(move,1))
		   + P(O(move,2)) * U(O(move,2))
		   = 0.80 * 0.88 + 0.20 * 0.76

So now you know the expected utility of attempting to move from state P0 
to state P1.

At this point, we still don't know how to get a utility function.  We do 
know how to use it to calculate the expected utility of an action.  
Following are a few more interesting points that will help us with 
utility functions. 

If you have a performance measure for the agent, then you want a utility 
function that reflects the desired results of the performance measure.  
This will give the agent the optimal behavior (according to the 
performance measure). 

The scale of a utility function doesn't really matter.  For instance, a 
new utility function U'(S) could be defined as:

U'(S) = (some constant) + (some positive constant) * U(S)

To an agent, U(S) and U'(S) mean pretty much the same thing. The 
relationship of the individual utility values is what matters.  For 
instance, the action move-right might have utility 1.0, and the action 
move-left might have the utility 0.5.  If you multiplied them both by 
two so that U(move-right) = 2.0 and U(move-left) = 1.0, the action 
move-right is still preferable over the action move-left because it has 
a higher utility value in both cases. 

In multi-attribute utility theory, the outcomes depend on two or more 
attributes. 

	U(x1, ..., xn) = f(f1(x1), ..., fn(xn))

This discussion emphasizes single-attribute utility functions, which 
will be applicable in policy generation.  Therefore, this paper does not 
go into an example of a multi-attribute utility function.  Suffice to 
say, they do exist. 

A utility function is separable if and only if we can find a function f 
such that

	U(<s0, s1, ... , sn>) = f(s0, U(<s1, ... , sn>))

For instance

	U(<s0, s1, ... , sn>) = R(s0) + U(<s1, ... , sn>)

Now that we have some background in decision theory and utility theory, 
it's time to tackle the main event. 

-----------------------------------


Computing optimal policies for Markov Decision Processes:
----------------------------------------------------------


You might, theoretically, be able to enumerate all of the possible
policies for a state space, and then pick the one with the maximum
expected value function.  However, this can readily blow up since the
number of policies is exponetial in the size of the state space.

Instead, the methods of policy generation take advantage of the
optimal substructure principle.  This means that the optimal policy
will also be locally optimal for each individual state. 

Suppose you are in state s. What should be the action you should do?

 We want to do the action such that the state sequences that the
dynamical system exhibits starting from s, under this policy, will
have the maximum value. 

  Let us define the utility U of the state s to be the maximal
expected value of state sequences starting from state s.
 
  IF we know U(s) for all states, then we can use it to decide the
optimal policy. The optimal policy will be to do action a from state si

   if  the quantity 

      M(a, si, sj) * U(sj) is maximum for a.

  So, how do we find the utilty values for all states?

Here is what we know:

 1. Utility values for terminal states (i.e., states from which no
    actions are possible) are known a priori (they are just equal to
    the value of the reward function at that state.

 2. The utility value of a state si is related to the utility value of
    its neighboring states in terms of the transition probabilities as
    follows: 

        U(i) = R(i) + max(a) SUM [M(a,i,j) * U(j)]

The utility of the state is the reward of the state plus the value of 
the best option (which represents a history, or state sequence).  For 
the state i, U(i) is the optimal value.  In writing this equation, we
have used the fact that optimal policy will involve doing the optimal
thing at each state. 

 To develop a history, you iteratively use the utility equation so that 
the rewards of the terminal states have the opportunity to propagate out 
through all the other states (essentially to form a history from each 
state to each terminal state). 

One way to solve for the utility values is to use dynamic programming.
Dynamic programming is an efficient way to solve a set of equations
such as these.  It gives exact utility values right away, but it can
be costly in large state spaces.  We wil not discuss the details of
dynamic programming.

Two other methods of calculating optimal policies in MDPs are value 
iteration and policy iteration. Both these methods essentially try to
modify the utilities of the neighboring states such that they will
satisify the utility recurrence equations. If this local modification
process is repeated at each state in the state space, and for enough
number of iterations, then finally the utilities of the individual
states will converge to their correct values. 

  Following is a detailed description of each method.

Value Iteration
---------------
The basic idea is to calculate a utility for each state (given a  
transition model and a reward function), and then to use these utilities 
to create a policy.

The utility of a state somehow needs to depend on where you can get to 
from that state.  Therefore, the utility of the state depends on the 
utilities of a sequence of states (an environment history). The utility 
of a state is its reward value for being in that state, plus the maximum 
expected utility of the state sequences possible from that state. 

        U(i) = R(i) + max(a) SUM [M(a,i,j) * U(j)]

If there is only one possible action a from state i, then

        U(i) = R(i) + SUM [M(a,i,j) * U(j)]

Following is a breakdown of what this equation means. 

1)  M(a,i,j) is the probability of reaching state j when executing 
action a in state i.

Example: In the grid world, the probability of reaching state (0,1) from 
state (0,0) is 0.8 when executing action up. The probability of reaching 
state (1,1) from state (0,0) is 0 for all actions (because each action 
moves the agent by at most one square). 

2)  M(a,i,j) * U(j) is the expected utility of state j when executing 
action a in state i.

Example:  If the utility of state (0,1) is 0.75, then the expected 
utility of state (0,1) when executing action up from state (0,0) is 0.8 
* 0.75. 

3)  SUM [M(a,i,j) * U(j)] is expected utility of action a when executing 
action a in state i.  This is the sum of the expected utilities when 
executing action a in state i. 

Example:  Following is a list of expected utilities when executing 
action left in grid world: 

	EU(left(0,0) -> (1,0)) = EU(1,0) = 0.68
	EU(left(0,0) -> (0,1)) = EU(0,1) = 0.13
	EU(left(0,0) -> (0,0)) = EU(0,0) = 0.15
	EU(left(0,0) -> (*,*)) = EU(*,*) = 0

Then the expected utility of action left in state (0,0) is

	EU(left) = 0.68 + 0.13 + 0.15 = 0.96

4)  max(a) SUM [M(a,i,j) * U(j)] is the maximum expected utility of 
state i when executing actin a in state i. 

Example:  Following is a list of the expected utilities of actions when 
executed from state (0,0):

	EU(left)  = 0.96
	EU(right) = 0.82
	EU(up)    = 0.92
	EU(down)  = 0.66

Then the maximum expected utility of state (0,0) is 0.96 (the expected 
utility of action left). 

5) 	U(i) = R(i) + max(a) SUM [M(a,i,j) * U(j)]
is the utility of state i. 

Example:  If state (0,0) has reward -0.04, then its utility is

	-0.04 + 0.96 = 0.92

The value iteration function takes a transition model M and a reward 
function R, and returns a utility function U.  Following is the 
value-iteration algorithm, and then a detailed example. 

utility-function VALUE-ITERATION (transition-model M, reward-function R)
{
	utility-function U = R
	utility-function U-copy

	do
	{
		U-copy = U

		for-each-non-terminal-state i do
			U[i] = R[i] + max(a) SUM [M(a,i,j) * U-copy(j)]
	}
	while (U is-not-close-to U-copy)

	return U
}

Notice that by using a copy of the utility function to do the 
calculations, the new utility function values always depend on the 
values of the previous iteration.  This means that, during an iteration, 
the values that you calculate will not interfere with each other. 

Example of Value Iteration:

This example uses the world from above where there is an 80% chance that 
the operator will move the agent in the specified direction, and a 20% 
chance that the operator will move the agent at a right angle to the 
specified direction.  The reward for a state is R = -0.04.

As specified by the algorithm, the initial utility function is equal to 
the reward function. 

+-------+-------+-------+
|       |       |       |
| -0.04 | -0.04 |  +1   |
|       |       |       |
+-------+-------+-------+
|       |       |       |
| -0.04 | -0.04 |  -1   |
|       |       |       |
+-------+-------+-------+
|       |       |       |
| -0.04 | -0.04 | -0.04 |
|       |       |       |
+-------+-------+-------+

Next we show the utility function after one iteration. 

+-------+-------+-------+
|       |       |       |
| -0.08 | +0.75 |  +1   |
|       |       |       |
+-------+-------+-------+
|       |       |       |
| -0.08 | -0.08 |  -1   |
|       |       |       |
+-------+-------+-------+
|       |       |       |
| -0.08 | -0.08 | -0.08 |
|       |       |       |
+-------+-------+-------+

Notice that state (1,2) is the only state that has a positive utility 
value.  It is currently the only state that has a known path to the 
positive termination state.  As the iterations continue, the other 
states will benefit from the high utility value of the positive 
termination state. 

Following is the trace of how the algorithm gets the utility value for 
state (1,2): 

M(up,(1,2),(1,2)) * U-copy(1,2) = 0.80 * -0.04 = -0.032
M(up,(1,2),(0,2)) * U-copy(0,2) = 0.10 * -0.04 = -0.004
M(up,(1,2),(2,2)) * U-copy(2,2) = 0.10 * 1.0 = 0.10

M(down,(1,2),(1,1)) * U-copy(1,1) = 0.80 * -0.04 = -0.032
M(down,(1,2),(0,2)) * U-copy(0,2) = 0.10 * -0.04 = -0.004
M(down,(1,2),(2,2)) * U-copy(2,2) = 0.10 * 1.0 = 0.10

M(left,(1,2),(0,2)) * U-copy(0,2) = 0.80 * -0.04 = -0.032
M(left,(1,2),(1,2)) * U-copy(1,2) = 0.10 * -0.04 = -0.004
M(left,(1,1),(1,1)) * U-copy(1,1) = 0.10 * -0.04 = -0.004

M(right,(1,2),(2,2)) * U-copy(2,2) = 0.80 * 1.0 = 0.80
M(right,(1,2),(1,2)) * U-copy(1,2) = 0.10 * -0.04 = -0.004
M(right,(1,1),(1,1)) * U-copy(1,1) = 0.10 * -0.04 = -0.004

	up -->    SUM = -0.032 + -0.004 +  0.10  =  0.064
	down -->  SUM = -0.032 + -0.004 +  0.10  =  0.064
	left -->  SUM = -0.032 + -0.004 + -0.004 = -0.04
	right --> SUM =  0.80  + -0.004 + -0.004 =  0.792

Since action right has the maximum expected utility, we use the value of 
action right in the final calculation of the utility for state (1,2).

	U(1,2) = R(1,2) + EU(right) = -0.04 + 0.792 = 0.752
(rounded to 0.75 in the figure)

Here is the utility function after a second iteration.

+-------+-------+-------+
|       |       |       |
| +0.55 | +0.83 |  +1   |
|       |       |       |
+-------+-------+-------+
|       |       |       |
| -0.12 | +0.45 |  -1   |
|       |       |       |
+-------+-------+-------+
|       |       |       |
| -0.12 | -0.12 | -0.12 |
|       |       |       |
+-------+-------+-------+

Following is final utility function, after many iteration.  Even if you 
continue to perform the value iteration algorithm on this utility 
function, the values will not chance.  They have converged to their 
final values. 

+-------+-------+-------+
|       |       |       |
| +0.87 | +0.93 |  +1   |
|       |       |       |
+-------+-------+-------+
|       |       |       |
| +0.82 | +0.78 |  -1   |
|       |       |       |
+-------+-------+-------+
|       |       |       |
| +0.76 | +0.72 | +0.49 |
|       |       |       |
+-------+-------+-------+

Once you have the utility function, you can compute the policy:

	Policy(i) = arg max(a) SUM [M(a,i,j) * U(j)]

This equation determines the action for a single state.  The equation 
has many similarities to the equation for a state's utility, so we'll go 
straight to an example for state (0,0): 

	M(up,(0,0),(1,0)) * U(1,0) = 0.8 * 0.82 = 0.656
	M(up,(0,0),(0,1)) * U(0,1) = 0.1 * 0.72 = 0.072
	M(up,(0,0),(0,0)) * U(0,0) = 0.1 * 0.76 = 0.076

	M(down,(0,0),(0,0)) * U(0,0) = 0.8 * 0.76 = 0.608
	M(down,(0,0),(1,0)) * U(1,0) = 0.1 * 0.72 = 0.072
	M(down,(0,0),(0,0)) * U(0,0) = 0.1 * 0.76 = 0.076

	M(left,(0,0),(0,0)) * U(0,0) = 0.8 * 0.76 = 0.608
	M(left,(0,0),(0,1)) * U(0,1) = 0.1 * 0.82 = 0.082
	M(left,(0,0),(0,0)) * U(0,0) = 0.1 * 0.76 = 0.076

	M(right,(0,0),(0,1)) * U(0,1) = 0.8 * 0.72 = 0.576
	M(right,(0,0),(1,0)) * U(1,0) = 0.1 * 0.82 = 0.082
	M(right,(0,0),(0,0)) * U(0,0) = 0.1 * 0.76 = 0.076

	up -->    SUM = 0.656 + 0.072 + 0.076 = 0.804
	down -->  SUM = 0.608 + 0.072 + 0.076 = 0.756
	left -->  SUM = 0.608 + 0.082 + 0.076 = 0.766
	right --> SUM = 0.576 + 0.082 + 0.076 = 0.734

Action up has the highest expected utility, so the policy uses action up 
for state (0,0).  Below is the completed policy. 

+-------+-------+-------+
|       |       |       |
| right | right |  +1   |
|       |       |       |
+-------+-------+-------+
|       |       |       |
| up    | left  |  -1   |
|       |       |       |
+-------+-------+-------+
|       |       |       |
| up    | left  | left  |
|       |       |       |
+-------+-------+-------+

The terminal states do not have associated actions, either because there 
is no way to get out of the state, or because you don't want the agent 
to leave the state.  (In the second case, there might exist an action 
that would get the agent out of the state, but you prevent it from being 
used in the policy.) 

Notice that the action for the middle square is left instead of up.  
While going left leads to a longer path, it is a safer path because 
there is no chance of accidentally ending up in the negative terminal 
state. 

What happens when the cost of a step is higher?  This corresponds to a 
larger negative reward.  Below is the policy resulting from using a 
reward of R = -0.1. 

+-------+-------+-------+
|       |       |       |
| right | right |  +1   |
|       |       |       |
+-------+-------+-------+
|       |       |       |
| up    | up    |  -1   |
|       |       |       |
+-------+-------+-------+
|       |       |       |
| up    | up    | left  |
|       |       |       |
+-------+-------+-------+

Notice that the action for the middle square is now up instead of left.  
The policy decided to take the risk of accidentally landing in the 
negative terminal state.

There is still one part of the algorithm that hasn't been explained.  
When does the iteration stop?  In the ideal case, the utility function 
will converge to static values.  In this case, you can compare U to 
U-copy to see if the utility function has changed.  But what do you do 
if the utility function does not converge?  This can happen if the state 
space is very large, or if a cycle develops in the path (essentially, 
the length of the action sequence is unbounded).  One possible solution 
is to use discounting.  The basic idea is that the value of the reward 
is reduced down to nothing over time.  Eventually, the utility function 
will converge.  If you just want to make sure that you don't calculate 
forever, then you might want to use an iteration limit (stop at n number 
of iterations). 

Policy Iteration
----------------

The basic idea is to start with some policy, and then to repeatedly 
calculate the utility function for that policy, and then to calculate a 
new policy, and so on. This takes advantage of the fact that the policy 
often converges long before the utility function converges.
In particular, small changes in utility values of the individual
states may not have a significnat effect on the policy. 

The idea of policy iteration is to 

  --start with the policy from the previous iteration (initially, it
     is set to some default locally optimal actions for each state).

  -- Compute the utility values of all states given this policy.

     This can be done by solving the equations:

         U(si) = R(si) +   SUM M(P(si), i , j) * U(sj)

     where P(si) is the action to be taken in state si given the
     current  policy.

  -- Now, turn around and check if under these utility values, the
     current policy tells the locally optimal decision for each
     state. 

 The reason this makes sense is that the utiltiy values for a policy
are globally determined by the whole policy, and thus can even out any
local inconsistencies in the policy. 

Following is the policy-iteration algorithm. 

policy POLICY-ITERATION (transition-model M, reward-function R)
{
	utility-function U = R
	policy P

	// Initialize the policy according to the utility function.

	for each-non-terminal-state i do
		P[i] = arg max(a) SUM (M(a,i,j) * U[j])

	policy-unchanged = true

	do
{
		// Calculate the utility function according to the policy.

		U = VALUE-DETERMINATION(P,U,M,R)

		// If there is an action that has a higher expected
		// utility than the action currently in the policy,
		// then put the new action in the policy. 

		for each-non-terminal-state i do
		{
           		v1 = max(a) SUM (M(a,i,j) * U[j])
			v2 = max(a) SUM (M(P[i],i,j) * U[j])

      		if v1 > v2
			{
                  	P[i] = arg max(a) SUM (M(a,i,j) * U[j])
				policy-unchanged = false
			}
		}
	}
	while (policy-unchanged)

	return P
}

The function VALUE-DETERMINATION can be implemented just like the 
VALUE-ITERATION function except that the utility values are calculated 
according to the following equation: 

        U[i] = R[i] + SUM (M(P[i],i,j) * U-copy[j])

Note that instead of max(a), this equation uses the action from the 
policy. 

The VALUE-DETERMINATION step can also be solved as a system of linear 
equations. 

Since the equations used in policy iteration are very similar to those 
used in value iteration, we won't go through an example. 

Dealing With a Large State Spaces
--------------------------------

If you don't want to compute a complete policy for a large state space, 
it is possible to compute a partial policy using envelope expansion.  
The basic idea is that you start out with a subset of the full state
space, and consider any state that is not in that subset to belong to
a state called "other". We now compute the optimal policy for this
restricted MDP. 

Next, we can enlarge the MDP by bringing out some of the states in teh
"OTHER" state back into the MDP. The states that we bring in can be
selected based on the likelyhood that the current optimal policy will
push the agent into them. This is essentially a way of expanding the
policy to include likely nearby states, and is similar to the
technique we discussed in the context of replanning. 

This technique is used by Dean et. al. to do autonomous robot
navigation. They constructe the initial restricted MDP by using
something like BURIDAN which computes a single plan from the goal
state to the initial state.

OTHER ISSUES:
------------

Modeling Realworld problems like autonomous navigation in terms of 
MDPs raises more issues. For example, the agent may have faulty
sensors and thus may not know exactly what state it is in. Thus we
have to handle _partially_ observable markov processes.  Further more,
many times, the agent doesn't have a complete model of its environment
-- ie the transition table for all actions and all states. This
necessitates that the agent _learn_ the transition model through
experience. The area of reinforcement learning addresses this issue. 


Relation between Sequential Decision Problems and Buridan:
----------------------------------------------------------

Buridan does probabilistic planning, and thus its problem can also be
represented as a sequential decision problem in terms of
MDPs. However, it turns out that Buridan approach is much more
efficient compared to the MDP approach where both are applicable. 

Buridan approach is applicable in scenarios where we only have goals
of achievement, and are attempting to increase the probability of
satisfaction of the goals. Under these assumptions, the reward
function has zero value for everything other than the goal state and
the best action to be done in a given state is a function only of that
state and its relation to the goal state. Thus, a backward chaining
approach, used in Buridan, will be enough.

In general sequential decision processes on the othe hand, the best
action to be done in a state will depend not only on the "goal state"
but all the other states in the state space since all of them may have
non-zero reward values. In such cases, backward chaining is not enough
and we essentially have to do the dynamic programming and solve for
utilties of _all_ states. So, the work done in solving an MDP is
always proportional to the size of the state space (while the work
done in Buridan may be much smaller than the order of the size of the
state space). 


Conclusion
----------
In review, we've looked at Markov decision problems, including utility 
functions and optimal policies.  The definition of an MDP includes an 
accessible, stochastic environment with a known translation model.  One 
remaining question is, how do you get a translation model?  Since the 
environment is accessible, then one possibility is to use training 
sequences.  This entails running the agent through a bunch of test 
sequences in the environment, and recording how often the agent moves 
from one particular state to another.  You can build a transition model 
from the gathered statistics.  It might also be possible to observe the 
reward for each state. 

So, given an agents actions, you can get a reward function and 
transition model.  From these you can construct the utility function 
that represents an agent's actions.  Any finally, from all of this you 
can calculate a policy which solves a Markov decision problem. 

--------------------------------- END ----------------------------------