[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

some More on MDPs...



Here are some more factoids on MDPS..

Terminology:
 
 The "Value" function is also called "Cost-to-go" function in some
contexts. 

 The "horizon" is also called "stage" (finite horizon MDP is called
finite stage MDP)


=========
Doing value iteration for finite-horizon MDPs

If you have a finite-horizon (stage) MDP with K stages then, 
you will have essentially K+1 different value functions.

U_0 (which is the value you will get if you can't do any more actions)

U_1 (which is the value you will get if you can do at most one action)

...
U_K (which is the value you will get if you can do at most K actions)


Now, it is clear that the Bellman equation is written in term of U_m
and  U_m-1


U_m(i)  = R(i) + max_over_a SUM M_i,j U_m-1(j)   (Eqn I)

We know U_0 for all states (this is just the immediate reward of the
state).


So, using U_0, we can compute U_1, using U_1, we can compute U_2
etc.  and finally U_K. 


Once we have all these value functions, then, we can compute optimal 
policies for each of the stages (using the MEU principle).

=============

Clearly, the infinite horizon MDP can be seen as a special case of
finite-horizon one, where you only have one stage-independent optimal 
value function. The U_m and U_m-1 in Eqn I will both be U*

========

Rao
[Sep 23, 2003]