[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

CSE571: Comments on MDP homework

To: Subbarao Kambhampati <rao@asu.edu>
Subject: CSE571: Comments on MDP homework
From: "Tuan A. Nguyen" <tanguye1@asu.edu>
Date: Wed, 10 Oct 2012 17:17:11 -0700

Hi all:

Here is some information about for homework 1, followed by some comments that I have.

Please let me know if you have any questions regarding your grade (I will be in my lab, 557BA, tomorrow for office hour from 1-2pm).

Thanks a lot for your efforts, and I hope you've learned a lot through this homework.

Tuan

===

Statistics

Max: 195 (this is also the maximum you can get)

Min: 33

Mean: 136

Average: 127

Stdev: 42.7

Points for questions

Part 1:

1) Perform VI for K=2 steps on the 4x3 environment: 2 * 10pts

Note that you need to show me that you actually did it yourself to get full credits.

17.4: 3 * 10pts (the first is for correctly showing Bellman equations for the two cases, and the other twos are for showing transformation).

I saw different high-level ideas for the last two parts, and also gave partial credits depending how close I think it is to the solutions.

17.8: 4 * 10pts (each 10pts for each reward value)

Similar to part 1.1, you need to show me some computation steps, not necessarily all (so that I know you understand the equation), not simply the numbers.

17.9: 2 * 10pts (the first is for correctly derive utilities as functions of \gamma, and the second is for computing the value at which the utility "switches" between Up and Down.)

17.10: 5pts + 2 * 10pts (5pts for your qualitative reasoning about the solution policy, and each 10pts for performing policy iteration for 2 different initial policy).

Part 2:

- 20pts: For your effort on coding, I will have 20pts for your code.

- 10pts: You need to show that you test your code with 17.8 above.

- 2 * 15pts: Each for experiment results and observations in the 4x3 environment. Note that you must have some quantitative results (preferably in graphs, but I also gave partial credits for those having just plain numbers). For comparing the three algorithms VI, PI and modified PI, it is not enough just mentioning their complexity.

Some comments

17.4:

- Some students did not write the Bellman correctly for MDPs with rewards of the form R(s,a) and R(s,a,s').

For instance, "U(s) = R(s,a) + max {...}", or "U(s) = R(s,a,s') + max { ... }" ==> These are clearly wrong, since it is not clear which "a" and s' are being used here.

Other mistake:

U(s) = \max_{a} \max_{s'} R(s,a,s') + \max_{a} \sigma_{s'} ==> the "argmax" for the two different max operator can be different!

17.8:

- Some students did not write the Bellman correctly, for instance:

there is only one "sums of products" inside the max (meaning: only one action considered). Note that you need to take all applicable actions into account.
using current U(s) to replace R(s) in Bellman equation (this is an interesting mistake that I saw repetitions in some students). Note that rewards of states are fixed, and the discounted of U(s') for successor states s' will be "aggregated" into the utility of current state s.

17.10:

- Some students did not do the policy evaluation correctly. Note that you don't need to do iteration here, and the simple idea is that given a policy, corresponding utilities of states (if any) must follow a set of linear equations (quite similar to Bellman equations but without the max, since you know actions taken at states).

Prev by Date: Re: CSE571: RL project extension, and "make-up" TA hours
Next by Date: Two good links for POMDP value iteration
Previous by thread: Re: CSE571: RL project extension, and "make-up" TA hours
Next by thread: Two good links for POMDP value iteration
Index(es):
- Date
- Thread