Planning Class Notes 04/12/00 Notes taken by XuanLong Nguyen xuanlong@asu.edu In this class we mainly discuss the JAIr paper by Boutilier, Dean and Hanks Decision-Theoretic Planning: Structural Assumptions and Computational Leverage. Today's class focuses on up to section 2. First with Eric's question on the structure of the policy. Generally, a policy is a mapping from the set of *observable histories* to a set of actions. Roughly speaking, an optimal policy has to ensure that the expected value function for that policy is optimized (maximized or minimized). Given certain assumption about the the observability, the Markov assumption and other assumptions about time-separability and addivity, the structure of the optimal policy of a MDP can be more compact. For example, in the case of FOMDP with time separable value function, a policy is a mapping from the catersian set of *states* S and stage T to actions P: S x T -> A [ since the fully observabilities means that the observable histories is equal to the state histories; and the Markov assumption means that the policy only depends on the current state and current stage, not the whole state history]. In addition, if the policy is performed over an infinite horizon (as opposed to finite horizon) with a discounted value function, the optimal policy is shown (by Howard, 1960) to be in the form of P: S-> A, [i.e the policy is stationary, that is, independent of the current stage t]. As for NOMDP (non-observable), the policy becomes P: T -> A, since the agent no longer has knowledge of the world. The policy becomes a sequence of actions. It's probably this characterisitics of NOMDP that leads the authors to put classical AI planning under the NOMDP brand. As for POMDP, the policy structure is more complicated, both in term of representational complexity and computation. Since a POMDP can be viewed as a FOMDP with a state space consisting the set of belief states, each of which is a probability distribution over the set of states S = {S1,...,Sn}, this state space is continuous (and therefore infinite). Under Markovian and time-separable discounted value function assumption, an optimal policy for POMDP is a mapping from the catersian set b/w set of belief states and set of stages to set of actions. The problem with this representation is that the set of belief state is infinite and continuous. For this reason, exact algorithms exploit a finite representation of the value function, due to Sondik, who showed that the optimal value function V(belief_state, t) that results when the next t stages are considered (the so-called optimal t-horizon value function) is piecewise linear and convex, and can be represented by a *finite* set of real vectors. For infinite horizon problem, the optimal value function can be *approximately* represented by taking t sufficiently large. After approximating the value function for a belief state, we can find an action that optimize the value function for the given belief state. [further added by Long: In practice, the above approximation method for POMDP is limited to only very very small POMDP problem, because size of the base vectors for the value function can grow exponentially. A current approach for *goal-directed* POMDP is using Real time dynamic programming technique to avoid updating the value function for all belief states, but the more relevant ones (pursued by Koenig and Geffner). Thus we hope to obtain the optimal value for the value function of only a subset of belief states, and in effect being able to specified the policy for only a subset of "relevant" belief states. This partial policy is enough to ensure that we can reach the goal belief state from the initial belief state. A very different approach in the literature is to approximate the POMDP's optimal policy as a finite state controller, and in effect search for the approximately optimal policy in the set of finite state automata ( by the Brown group, E. Hansen, etc). In this case, the policy is a mapping from an observational histories to set of actions. The observational histories are extracted from the arcs of the policy graph. Note that there may not exist such a finite controller based optimal policy, because in the worst case, the problem of finding an optimal policy for POMDP with the objective of maximizing the epxected total reward over finite horizone is expoentially hard both in the number of states and stages, while the same problem over infinite horizon is recently shown to be undecidable (AAAI-99 paper by S. Hanks and his student.) ] Now, after digressing to the complicated details, Rao gets back to the more basic stuff presented in section 2. *Normative model: the MDP is called a normative model for planning under uncertainty and incomplate information. As opposed for layer-by-layer approach, which extends classical planning to deal with uncertainty or stochastic actions, MDP is a very general framework that put together nicely a variety of concepts. (see the paper) - incorporate the extragenous events into the stochastic effect of actions - define the policy as the output of MDP, and criteria for an optimal policy - concepts of system histories vs. observational histories The nice thing about the normative model is that we have a clear general definition of the problem, which include a well-captured model of the world (i.e with stochastic effect of actions), model of agent's knowledge (captured in the observational histories), the well-defined objective of the problem (what is the structure of the policy that you want to find, what is the criteria for optimality, etc). When we specifies further these characteristic, we will get a subset of the problem. For example, classical planning can also be modeled within the MDP framework. In this case, the effect of actions is made deterministic (probablity 1), the observational histories become identical to the system histories due to the determism and complete information. We will use the discounted value function in infinite horizon. To capture the satisficing goal of planning, The reward value for all state other than the goal state is 0. We also make the goal state a sink state, and the objective is to find a policy that optimize (or maximize minimize) the value function for the initial state. [ In fact, the MDP framework also allow us to find a policy that implicitly represent plans for every single state. ] While the MDP framwork is so expressive in handling stochastic dynamics and uncertainty, as well as specifying a variety of objective functions and optimization [ although not clear how it can deals with actions with durations and temporal constraints ], the main problem with MDP is that much of its work in OR has been dealing with the explicit representation of world states. This approach is prohibitively expensive, especially from AI point of view. In AI, state is represented by a set of features. Even though FOMDP problem is P-complete, it's P-complete in the number of states, which is actually exponential in terms of features representing state. Thus, not only is it computationally intractable to find a desirable policy, if the states are feature-based, it is also computationally intractable just to represent the FOMDP problem - These including how to represent the stochastic dynamics of the world (O(n^2) ~ O(2^(2*f)), where n is number of states, f is number of features), and the policy (which is also exponential in wrt f). Current approaches in AI that employ the MDP framework deal with the factored representation. Huge amount of works have been trying to cut down the state space through abstraction, aggregation - which is basically what OR people choose to do manually to wipe out irrelevant features before plugging in their MDP framwork - and how to do value iteration, policy iteration algorithms, etc.. within this factored MDP framework. There is a point that keeps bugging Rao in that why the authors catergorize classical planning as NOMDP (non-observable), and not as FOMDP. The main rationale for this catergorization is that, NOMDP typically returns a sequential plan, while FOMDP returns a contingent plan (or policy) based on set of observation (histories). Classical planning can also be put as FOMDP because given a plan, the executing agent knows exactly the world state after each plan step (because the actions are deterministic and the initial state is known). A related question is that, which one is more difficult in complexity, FOMDP and NOMDP? The first intuition is that NOMDP should be more difficult than FOMDP (just as life is harder for a blind guy than a fully visible guy). However, it may not be the case. What is harder than what should be based on what kind of objectives you want to set for yourself. If you are blind, you don't have much expectation of yourself than if you are not. In NOMDP, the objective is to find a sequential plan, while in FOMDP, the objective is to find a full policy. Thus the solution's search space for NOMDP is by one order of hierarchy smaller in complexity than that of FOMDP. Therefore it can be argued that NOMDP is actually easier than FOMDP problems. -- XuanLong Nguyen CSE, Arizona State Univ.