Planning Class Notes 04/12/00
Notes taken by XuanLong Nguyen xuanlong@asu.edu

In this class we mainly discuss the JAIr paper by Boutilier, Dean and Hanks
Decision-Theoretic Planning: Structural Assumptions and Computational
Leverage.

Today's class focuses on up to section 2.


First with Eric's question on the structure of the policy.

Generally, a policy is a mapping from the set of *observable
histories* to a set of actions. Roughly speaking, an optimal policy 
has to ensure that the expected value function for that policy is 
optimized (maximized or minimized).

Given certain assumption
about the the observability, the Markov assumption and other
assumptions about time-separability and addivity, the structure of
the optimal policy of a MDP can be more compact. For example,
in the case of FOMDP with time separable value function, a policy
is a mapping from the catersian set of *states* S and stage T
to actions P: S x T -> A [ since the fully observabilities means that the 
observable histories is equal to the state histories; and the Markov 
assumption means that the policy only depends on the current 
state and current stage, not the whole state history].
In addition, if the policy is performed over an infinite horizon
(as opposed to finite horizon)
with a discounted value function, the optimal policy is shown (by
Howard, 1960) to be in the form of P: S-> A, [i.e the policy is 
stationary, that is, independent of the current stage t]. 

As for NOMDP (non-observable), the policy becomes P: T -> A,
since the agent no longer has knowledge of the world. The policy
becomes a sequence of actions. It's probably this characterisitics of NOMDP
that leads the authors to put classical AI planning under the NOMDP brand.

As for POMDP, the policy structure is more complicated, both
in term of representational complexity and computation. Since a POMDP
can be viewed as a FOMDP with a state space 
consisting the set of belief states, each of which is a
probability distribution over the set of states S = {S1,...,Sn},
this state space is continuous (and therefore infinite). Under
Markovian and time-separable discounted value function assumption, 
an optimal 
policy for POMDP is a mapping from the catersian set b/w set of belief 
states and set of stages to set of
actions. The problem with this representation is that the set of 
belief state is infinite and continuous. For this reason, exact
algorithms exploit a finite representation of the value function,
due to Sondik, who showed that the optimal value function
V(belief_state, t) that results
when the next t stages are considered (the so-called optimal t-horizon
value function) is piecewise linear and convex, and can be represented
by a *finite* set of real vectors. For infinite horizon problem,
the optimal value function can be *approximately* represented by
taking t sufficiently large. After approximating the value function
for a belief state, we can find an action that optimize the
value function for the given belief state.

[further added by Long:
In practice, the above approximation method for POMDP is limited
to only very very small POMDP problem, because size of the base
vectors for the value function can grow exponentially.
A current approach for *goal-directed* POMDP is using 
Real time dynamic programming technique to avoid updating the value 
function for all belief states, but the more relevant ones (pursued
by Koenig and Geffner). Thus we hope to obtain the optimal value for
the value function of only a subset of belief states, and in effect
being able to specified the policy for only a subset of "relevant"
belief states. This partial policy is enough to ensure that
we can reach the goal belief state from the initial belief state.
A very different approach in the 
literature is to approximate the POMDP's optimal policy as a
finite state controller, and in effect search for the approximately
optimal policy in the set of finite state automata ( by the Brown
group, E. Hansen, etc). In this case, the policy is a mapping from
an observational histories to set of actions. The observational
histories are extracted from the arcs of the policy graph. Note
that there may not exist such a finite controller based optimal
policy, because in the worst case, the problem of finding
an optimal policy for POMDP with the objective of maximizing the
epxected total reward over finite horizone is expoentially hard both 
in the number of states and stages, while the same problem over
infinite horizon is recently shown to be undecidable (AAAI-99 paper
by S. Hanks and his student.)
]

Now, after digressing to the complicated details, Rao gets back
to the more basic stuff presented in section 2.

*Normative model: the MDP is called a normative model for planning
under uncertainty and incomplate information. As opposed for
layer-by-layer approach, which extends classical planning to
deal with uncertainty or stochastic actions, MDP is a very general
framework that put together nicely a variety of concepts. (see the paper)

- incorporate the extragenous events into the stochastic effect of
actions
- define the policy as the output of MDP, and criteria for
an optimal policy
- concepts of system histories vs. observational histories

The nice thing about the normative model is that we have a clear
general definition of the problem, which include a well-captured
model of the world (i.e with stochastic effect of actions), model
of agent's knowledge (captured in the observational histories),
the well-defined objective of the problem (what is the structure
of the policy that you want to find, what is the criteria for
optimality, etc). 

When we specifies further these characteristic, we will get a
subset of the problem. For example, classical planning can also
be modeled within the MDP framework. In this case, the effect of
actions is made deterministic (probablity 1), the observational
histories become identical to the system histories due to the
determism and complete information. We will use the discounted
value function in infinite horizon. 
To capture the satisficing goal of planning,
The reward value for all state other than the goal state is 0.
We also make the goal
state a sink state, and the objective is to find a policy that
optimize (or maximize minimize) the value function for the initial
state. [ In fact, the MDP framework also allow us to find a policy 
that implicitly represent plans for every single state.  ]

While the MDP framwork is so expressive in handling stochastic
dynamics and uncertainty, as well as specifying a variety of
objective functions and optimization [ although not clear how
it can deals with actions with durations and temporal constraints ],
the main problem with MDP is that much of its work in OR has been
dealing with the explicit representation of world states.
This approach is prohibitively expensive, especially from AI point
of view. In AI, state is represented by a set of features.
Even though FOMDP problem is P-complete, it's P-complete in the number
of states, which is actually exponential in terms of features
representing state. Thus, not only is it computationally intractable
to find a desirable policy, if the states are feature-based, it is also
computationally intractable just to represent the FOMDP problem -
These including how to represent the stochastic dynamics of the
world (O(n^2) ~ O(2^(2*f)), where n is number of states, f is number
of features), and the policy (which is also exponential in wrt f).

Current approaches in AI that employ the MDP framework deal with
the factored representation. Huge amount of works have been trying
to cut down the state space through abstraction, aggregation 
- which is basically what OR people choose to do manually to wipe
out irrelevant features before plugging in their MDP framwork - 
and how to do value iteration,
policy iteration algorithms, etc.. within this factored MDP framework. 

There is a point that keeps bugging Rao in that why the authors
catergorize classical planning as NOMDP (non-observable), and not
as FOMDP. The main rationale for this catergorization is that,
NOMDP typically returns a sequential plan, while FOMDP returns
a contingent plan (or policy) based on set of observation (histories).
Classical planning can also be put as FOMDP because given a
plan, the executing agent knows exactly the world state after
each plan step (because the actions are deterministic and the
initial state is known). A related question is that, which one
is more difficult in complexity, FOMDP and NOMDP? The first
intuition is that NOMDP should be more difficult than FOMDP 
(just as life is harder for a blind guy than a fully visible guy).
However, it may not be the case. What is harder than what should
be based on what kind of objectives you want to set
for yourself. If you are blind, you don't have much expectation
of yourself than if you are not. In NOMDP, the objective is
to find a sequential plan, while in FOMDP, the objective is
to find a full policy. Thus the solution's search space for NOMDP
is by one order of hierarchy smaller in complexity than that of 
FOMDP. Therefore it can be argued that NOMDP is actually easier than 
FOMDP problems.


--
XuanLong Nguyen
CSE, Arizona State Univ.