Last chance to don the [Thinking Cap] on MDPS & learning..

To: Rao Kambhampati <rao@asu.edu>

Subject: Last chance to don the [Thinking Cap] on MDPS & learning..

From: Subbarao Kambhampati <rao@asu.edu>

Date: Thu, 30 Apr 2009 15:56:34 -0700

Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:received:date :x-google-sender-auth:message-id:subject:from:to:content-type; bh=t/c+Q6Tiz55M2AUtl/eIUN1A92dOspXTyig2f1wSHnc=; b=ZkyBttRSbCwJOxd2PsSkAUljMbtV7dtpnG0GZh1Ay29UM0bdKoFGcK0+vkIc1kOQff wCk68dzOBp7Bua7aWGzXTNLyEBlxYqV3olLgsb3w/CoppBZKK7M7ukvTW3N1Uw9+AzoJ GpwbMWxjwwAJeNMzGrSEZ/WNORzFFUrs5pVjk=

Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:date:x-google-sender-auth:message-id:subject :from:to:content-type; b=HY/+0tQcH3n8EEyggKZsHwpkVWWZ/JeSlWOxR4lEho/XrUf9aB9pZEhqfCiGcis6UC 0SBY6LXxfhKBcyxjRFJ5qRNAsN7Ief2VrULUgmKc2Bwtm0cqA7jX+yr68c5HxRF6Katz NRKubDtI1iRgxPKgUaBBxPlryywNdXSvlg9TU=

Sender: subbarao2z2@gmail.com

****************MDPs

1. What happens to MDPs if the underlying dynamics are "deterministic"? Can we still talk about value and policy iteration? Do we still have non-stationary policies for finite horizon deterministic MDPs?

2. We talked about how infinite horizon case is easier than finite horizon case, and said, in passing, that here is one case where "infinite" is easier to handle. Consider the following two cases:

2.1. CSPs where the variables are "real valued", and constraints between variables are expressed as linear inequalities. Clearly, the number of potential assignments for the CSP variables is infinite. What do you think will be the complexity of finding a satisfying assignment of the variables? (recall that discrete variable CSP is NP-complete) If the complexity is "low" by AI standards, what can you do to increase it? (hint: consider modifying constraints).

2.2. Planning problem where (some of) the variables are real valued. Actions have effects that can increment and decrement the variables by arbitrary amounts. What do you think will be the complexity of planning in this case? (recall that discrete variable planning is PSPACE-complete, i.e., it is among the hardest problems that can be solved with polynomial space).

3. The MDPs we considered until now are called "Fully observable"--in that during execution, the agent can tell which state it is in (thus the policy needs to only map states to actions).? What happens if the domain is only "partially observable".
Note that this means that the agent may not know which unique state it is in, but knows the? "probability distribution" over the possible states it could be in. When the agent does an action, it effect of the action is to modify the distribution into a new distribution over states (with some states being more likely and others less.

Notice that the "distribution" is fully observable although the underlying states aren't.
So, one idea will be to consider the distributions as the states. What happens to the complexity of MDPs? How is this connected to MDPs over "continuous" state spaces?

[Notice that the above still works for the special case fully observable domain. The initial distribution will be a delta function--with probablity being 1.0 for a single state, and 0.0 for all others. Doing an action converts into another delta function--with the 1.0 for a different state].

*****************Learning

Qn 0. [George Costanza qn] Consider two learners that are trying to solve the same classification problem with two classes (+? and -). L1 seems to be averaging about 50% accuracy on the test cases while L2 seems to be averaging 25% accuracy. Which learner is good?? Why is this called the George Costanza question? ;-)

Qn?1. Consider a scenario where the training set examples have been labeled by a slightly drunk teacher--and thus they sometimes have wrong labels (e.g. +ve are wrongly labelled negative etc.).? Of course, for the learning to be doable, the percentage of these mislabelled instances should be quite small.? We have two learners, L1 and L2. L1 seems to be 100% correct on the *training* examples. L2 seems to be 90% correct on the training examples. Which learner is likely to do well on test cases?

Qn 2.? Compression involves using the pattern in the data to reduce the storage requirements of the data.? One way of doing this would be to find the rule underlying the data, and keep the rule and throw the data out. Viewed this way, compression and learning seem one and the same. After all, learning too seems to take the training examples, find a hypothesis ("pattern"/"rule") consistent with the examples, and use that hypothesis instead of the training examples.? What, if any, differences do you see between Compression and Learning?

Qn 3. We said that most human learning happens in the context of prior knowledge. Can we view prior knowledge as a form of bias?

In particular, can you say that our prior knowledge helps us focus on certain hypotheses as against other ones in explaining the data?

that is all for now.

Rao