Planning in Dynamic,Stochastic etc Domains. Class Notes: 04/26/2000 Note Taker: Ullas Discussion: Decision Theoretic Planning Paper Section 3.2 The previous class, we looked at MDPs. The value function and the Value Iteration and Policy Iteration. These methods are fine when the MDPs being considered are small ie with a small number of states. When we talk of structured representations it boils down to doing case studies to determine the best model. In the paper, the authors talk about decision trees as the structured representation But anything that can have function approximators can be used for the same. Finding structure in data is where Machine Learning is best about. In this paper the decision trees are talked about as function approximating scheme. Neural networks are another example of trying to find patterns in data. The MDP value iteration and policy iteration are represented as extensional representations. The authors talk about using intensional representaions as the size of the problem description can be reduced and then use it to find the value function. The section 5 mostly talks about how to operate on particular syntactic representation of the MDP. In this paper the decision trees and neural nets are discussed as part of the discussion on MDPs because we want to use the compressed representation. The concept of Reinforcement learning is closely tied to MDPs. The value functions are infinite compared to Policy functions. In reality More than One value function will maps to a Policy funciton. So we converge faster if we use policy iterations. The value propagates, using neighbourhood propagation conceptually. Henceforth unless stated, the infinite horizon model with of the MDP is considered. Convergence of Value Iteration * An approx optimal value function yields an arbitrarily good policy. |V* - V'|max <= E ==> |V*-Vp'|max <= 2 E (Y/1-Y) Where V* is the optimal Value function V' is the approx value function Vp' is the approx value function for the policy corresponding to the Value function * The error |V*-V'|max decreases by a factor of Y on every iteration. * Converges to optimal in polynomial time * Relaxations (assignments to V) can be done * asynchronouslly * in parallel As stated above if V is the set of all Value Functions and P is the set of all policy functions then, for all subsets of the Value Functions each subset will map to one Policy Function in the set P. But every Policy Function in the set P will map to one value function in the set V. This function will be the approx optimal function. Hence in the above eqn, for V' take the policy to which it maps and then find Vp'. Then using the Vp' when we find the difference from V*, we get a value which is less than E. All the claims will hold when we use aysnchronous updates. Synchronous updates mean all the values for V1 are updated to V2 before the V3 calculation. Asynchronous updates are better as conceptually the wave of updates will spread faster as the wave is not waiting for all the V2 values that are not updated in the run. In this method the order of updation matters. Updation is known as 'Relaxation'. The order of updation does not matter for synchronous updates because everybody has to do 1 iteration before the 2nd. Hence in asynchronous, when we change the values the assumption is somone has reached very near to their actual values. So their neighbours can start using their neighhours new values. This then is similar to DVO from CSP. But this idea of convergence is good if we want to reach as near as V*. But it not effective if we want to reach V*. The same is the case in CSP ie if we want all the solutions for the CSP then DVO does not matter. Prioritized Sweeping. Looking at higher valued states and updating them. If we do looping infinite number of times then all the states will be updated provided they have non-zero probabilities. In asynchronous approach to updating values, with prioritized sweeping how will the lower values be reached ??? This has to be written in to the algorithm ie generally with a high priority the larger valued states will be updated but every once in a while the lower valued states are updated with a smaller probability. Sounds to me like doing a random restart in hill climbing just to escape the local minima. Normally policy updations are cheaper than optimal value updates. The value function and the policy functions are infact learning algorithms. Different levels of learning algorithms are affected by how eager they are. How much we are wanting to learn in every iteration. These are all local relaxations. Rao calls it the 'SpreadSheet Metaphor'. Looking at your neighbour and changing your values, rather looking at your neighbour to verify that your value is correct. For eg. if cell Bs value is -1 and its immediate neighbour has value 1, then Cell B will know that it cannot be -1 so it will update its value and reach to its actual value. The word 'relaxation' is another word for 'consistency enforcement' in CSP. If we dont have a terminal or sink state then nobody has a real value. Then everybody tries to be consistent with their neighbours, so then it becomes local consistency enforcement. But unlike local search we do CONVERGE here. For a N-level CSP, after k-level consistency, we have reached global consistency enforcement. THe general idea then is we are relaxing over the value' of one's neighbours. How do we do the bookkeeping? BookKeeping will be made simpler since we hope that actions will not have uniform probability for all the states. Normally the Policy Iteration is that we decrease number of iterations, but increasing cost per iteration. Dr Rao" Think of policy iteration as some sort of mutex propagation. IF we do mutex propagation we can do faster search. But essentially mutexes are not required for search". Extending Classical Plan as MDP: Deterministic and Fully Observable. Once we have stochastic actions all states have probabilities. But when we are deterministic then the unreachable states will have 0 probabilities. Thus if MDP is deterministic then they can stop with S iterations where S is the Set of states. But if they are non-deterministic MDP will have complexities of order Polynomial in S. Reachability analysis will help in identifying the set of states that can be reached. But full reachability analysis is as hard as full planning. But graphplan does the approximate reachability. Hence one idea is. Take graphplan and make it till it levels off. Then check whether the states are good or bad. So whatever states are left after mutexes are considered as still reachable or unreachable. Consider, domain can be somewhat, non-deterministic. In that case take graphplan and use reahability analysis to give probabilistic reachability. This still gives a smaller MDP. But still we can reach states where the policy wont know what to do!!!! Why Use MDP for Classical Planning ? Generally one wont use MDP for classical planning. since we need linear plans. But reasons would be 1. Mutexes info in graphplan is for all states while it has to be done for relevant states only. So some work is for local consistency enforcement. The worf of DP is the same local consistency enforcement. Hence the use of MDP for classical plan problems is good. 2. A CSP can be made globally consistent then I can find a consistent assignment. But this is dumb if we just need 'a assignment'. But the idea can be relaxed with finding a k-consistency. Similary we can take classical planning problems and then put approximations in the MDP ad do DP. 3. The structured representations, ie intensional represention possible in MDP could be used to get a smaller representation. Initial State : MDPs dont take the initial state. But suppose we have a initial state then it does help.How?? MDPs are goal directed. If we know the initial state we can make the MDP forward directed. The tree generated is a Min-MAx tree. The authors call it the expectimax tree. The size of the tree will be based on how far to look ie the horizon. After the tree is made the work left is to determine the value for Sinit ie the root node of the tree in a bottom up fashion. The choice that maximises the value of Sinit is taken. For every state node- the maximizing value of the successor actions is taken. But at every action node we find the expected value of its successor states using value iteration. The name for this kind of online algorithm is RTDP (Real Time Dynamic Programming).