[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
some More on MDPs...
Here are some more factoids on MDPS..
Terminology:
The "Value" function is also called "Cost-to-go" function in some
contexts.
The "horizon" is also called "stage" (finite horizon MDP is called
finite stage MDP)
=========
Doing value iteration for finite-horizon MDPs
If you have a finite-horizon (stage) MDP with K stages then,
you will have essentially K+1 different value functions.
U_0 (which is the value you will get if you can't do any more actions)
U_1 (which is the value you will get if you can do at most one action)
...
U_K (which is the value you will get if you can do at most K actions)
Now, it is clear that the Bellman equation is written in term of U_m
and U_m-1
U_m(i) = R(i) + max_over_a SUM M_i,j U_m-1(j) (Eqn I)
We know U_0 for all states (this is just the immediate reward of the
state).
So, using U_0, we can compute U_1, using U_1, we can compute U_2
etc. and finally U_K.
Once we have all these value functions, then, we can compute optimal
policies for each of the stages (using the MEU principle).
=============
Clearly, the infinite horizon MDP can be seen as a special case of
finite-horizon one, where you only have one stage-independent optimal
value function. The U_m and U_m-1 in Eqn I will both be U*
========
Rao
[Sep 23, 2003]