Planning in Dynamic,Stochastic etc Domains.
Class Notes: 04/26/2000
Note Taker: Ullas
Discussion: Decision Theoretic Planning Paper Section 3.2
The previous class, we looked at MDPs. The value function and the Value Iteration
and Policy Iteration. These methods are fine when the MDPs being considered are
small ie with a small number of states. When we talk of structured representations
it boils down to doing case studies to determine the best model.
In the paper, the authors talk about decision trees as the structured representation
But anything that can have function approximators can be used for the same.
Finding structure in data is where Machine Learning is best about. In this paper the
decision trees are talked about as function approximating scheme. Neural networks
are another example of trying to find patterns in data.
The MDP value iteration and policy iteration are represented as
extensional representations. The authors talk about using intensional
representaions as the size of the problem description can be reduced and
then use it to find the value function.
The section 5 mostly talks about how to operate on particular syntactic
representation of the MDP. In this paper the decision trees and neural
nets are discussed as part of the discussion on MDPs because we want to
use the compressed representation.
The concept of Reinforcement learning is closely tied to MDPs.
The value functions are infinite compared to Policy functions.
In reality More than One value function will maps to a Policy funciton.
So we converge faster if we use policy iterations.
The value propagates, using neighbourhood propagation conceptually.
Henceforth unless stated, the infinite horizon model with of the MDP
is considered.
Convergence of Value Iteration
* An approx optimal value function yields an arbitrarily good policy.
|V* - V'|max <= E ==> |V*-Vp'|max <= 2 E (Y/1-Y)
Where V* is the optimal Value function
V' is the approx value function
Vp' is the approx value function for the policy corresponding to
the Value function
* The error |V*-V'|max decreases by a factor of Y on every iteration.
* Converges to optimal in polynomial time
* Relaxations (assignments to V) can be done
* asynchronouslly
* in parallel
As stated above if V is the set of all Value Functions and P is the set of
all policy functions then, for all subsets of the Value Functions each
subset will map to one Policy Function in the set P. But every Policy
Function in the set P will map to one value function in the set V. This
function will be the approx optimal function.
Hence in the above eqn, for V' take the policy to which it maps and then
find Vp'. Then using the Vp' when we find the difference from V*, we get a
value which is less than E.
All the claims will hold when we use aysnchronous updates. Synchronous
updates mean all the values for V1 are updated to V2 before the V3
calculation.
Asynchronous updates are better as conceptually the wave of updates will
spread faster as the wave is not waiting for all the V2 values that are
not updated in the run. In this method the order of updation matters.
Updation is known as 'Relaxation'. The order of updation does not matter
for synchronous updates because everybody has to do 1 iteration before the
2nd.
Hence in asynchronous, when we change the values the assumption is somone
has reached very near to their actual values. So their neighbours can
start using their neighhours new values. This then is similar to DVO from
CSP.
But this idea of convergence is good if we want to reach as near as V*.
But it not effective if we want to reach V*.
The same is the case in CSP ie if we want all the solutions for the CSP
then DVO does not matter.
Prioritized Sweeping. Looking at higher valued states and updating them.
If we do looping infinite number of times then all the states will be
updated provided they have non-zero probabilities.
In asynchronous approach to updating values, with prioritized sweeping how
will the lower values be reached ???
This has to be written in to the algorithm ie generally with a high
priority the larger valued states will be updated but every once in a
while the lower valued states are updated with a smaller probability.
Sounds to me like doing a random restart in hill climbing just to escape
the local minima.
Normally policy updations are cheaper than optimal value updates.
The value function and the policy functions are infact learning
algorithms. Different levels of learning algorithms are affected by how
eager they are. How much we are wanting to learn in every iteration.
These are all local relaxations. Rao calls it the 'SpreadSheet Metaphor'.
Looking at your neighbour and changing your values, rather looking at your
neighbour to verify that your value is correct. For eg. if cell Bs value
is -1 and its immediate neighbour has value 1, then Cell B will know that
it cannot be -1 so it will update its value and reach to its actual value.
The word 'relaxation' is another word for 'consistency enforcement' in
CSP.
If we dont have a terminal or sink state then nobody has a real value.
Then everybody tries to be consistent with their neighbours, so then it
becomes local consistency enforcement.
But unlike local search we do CONVERGE here.
For a N-level CSP, after k-level consistency, we have reached global
consistency enforcement. THe general idea then is we are relaxing over the
value' of one's neighbours.
How do we do the bookkeeping?
BookKeeping will be made simpler since we hope that actions will not have
uniform probability for all the states. Normally the Policy Iteration is
that we decrease number of iterations, but increasing cost per iteration.
Dr Rao" Think of policy iteration as some sort of mutex propagation. IF we
do mutex propagation we can do faster search. But essentially mutexes are
not required for search".
Extending Classical Plan as MDP: Deterministic and Fully Observable.
Once we have stochastic actions all states have probabilities. But when we
are deterministic then the unreachable states will have 0 probabilities.
Thus if MDP is deterministic then they can stop with S iterations where S
is the Set of states. But if they are non-deterministic MDP will have
complexities of order Polynomial in S.
Reachability analysis will help in identifying the set of states that can
be reached. But full reachability analysis is as hard as full planning.
But graphplan does the approximate reachability. Hence one idea is. Take
graphplan and make it till it levels off. Then check whether the states
are good or bad. So whatever states are left after mutexes are considered
as still reachable or unreachable.
Consider, domain can be somewhat, non-deterministic. In that case take
graphplan and use reahability analysis to give probabilistic reachability.
This still gives a smaller MDP. But still we can reach states where the
policy wont know what to do!!!!
Why Use MDP for Classical Planning ?
Generally one wont use MDP for classical planning. since we need linear
plans. But reasons would be
1. Mutexes info in graphplan is for all states while it has to be done for
relevant states only. So some work is for local consistency enforcement.
The worf of DP is the same local consistency enforcement. Hence the use of
MDP for classical plan problems is good.
2. A CSP can be made globally consistent then I can find a consistent
assignment. But this is dumb if we just need 'a assignment'. But the idea
can be relaxed with finding a k-consistency. Similary we can take
classical planning problems and then put approximations in the MDP ad do
DP.
3. The structured representations, ie intensional represention possible in
MDP could be used to get a smaller representation.
Initial State :
MDPs dont take the initial state. But suppose we have a initial state then
it does help.How?? MDPs are goal directed. If we know the initial state we
can make the MDP forward directed. The tree generated is a Min-MAx tree.
The authors call it the expectimax tree. The size of the tree will be
based on how far to look ie the horizon. After the tree is made the work
left is to determine the value for Sinit ie the root node of the tree in a
bottom up fashion.
The choice that maximises the value of Sinit is taken.
For every state node- the maximizing value of the successor actions is
taken. But at every action node we find the expected value of its
successor states using value iteration.
The name for this kind of online algorithm is RTDP (Real Time Dynamic
Programming).