[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Thinking Cap] MDPs etc...



Consider the following:
 
0. Consider the  MDP applet demo we saw in the class today. The default discount factor was set to 0.9, and the reward states were made "absorbing". What do you think will happen if you make the discount factor to be 1.  Confirm your intuition with the applet. Now, do this part again but without any absorbing states. Again, try to guess what will happen, and confirm your intuition.

1. What happens to MDPs if the underlying dynamics are "deterministic"? Can we still talk about value and policy iteration? Do we still have non-stationary policies for finite horizon deterministic MDPs?
 
 
2. We talked about how infinite horizon case is easier than finite horizon case, and said, in passing, that here is one case where "infinite" is easier to handle. Consider the following two cases:
 
2.1. CSPs where the variables are "real valued", and constraints between variables are expressed as linear inequalities. Clearly, the number of potential assignments for the CSP variables is infinite. What do you think will be the complexity of finding a satisfying assignment of the variables? (recall that discrete variable CSP is NP-complete) If the complexity is "low" by AI standards, what can you do to increase it? (hint: consider modifying constraints).

2.2. Planning problem where (some of) the variables are real valued. Actions have effects that can increment and decrement the variables by arbitrary amounts. What do you think will be the complexity of planning in this case? (recall that discrete variable planning is PSPACE-complete, i.e., it is among the hardest problems that can be solved with polynomial space).


3. The MDPs we considered until now are called "Fully observable"--in that during execution, the agent can tell which state it is in (thus the policy needs to only map states to actions).  What happens if the domain is only "partially observable".
Note that this means that the agent may not know which unique state it is in, but knows the  "probability distribution" over the possible states it could be in. When the agent does an action, it effect of the action is to modify the distribution into a new distribution over states (with some states being more likely and others less.

Notice that the "distribution" is fully observable although the underlying states aren't.
So, one idea will be to consider the distributions as the states. What happens to the complexity of MDPs? How is this connected to MDPs over "continuous" state spaces?

[Notice that the above still works for the special case fully observable domain. The initial distribution will be a delta function--with probablity being 1.0 for a single state, and 0.0 for all others. Doing an action converts into another delta function--with the 1.0 for a different state].

That is all for now...


Rao