[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

CSE571: Comments on MDP homework



Hi all:
Here is some information about for homework 1, followed by some comments that I have.
Please let me know if you have any questions regarding your grade (I will be in my lab, 557BA, tomorrow for office hour from 1-2pm).
Thanks a lot for your efforts, and I hope you've learned a lot through this homework.
Tuan

===
Statistics
Max: 195 (this is also the maximum you can get)
Min: 33
Mean: 136
Average: 127
Stdev: 42.7

Points for questions

Part 1:
1) Perform VI for K=2 steps on the 4x3 environment: 2 * 10pts
Note that you need to show me that you actually did it yourself to get full credits.

2)

17.4: 3 * 10pts (the first is for correctly showing Bellman equations for the two cases, and the other twos are for showing transformation). 

I saw different high-level ideas for the last two parts, and also gave partial credits depending how close I think it is to the solutions.

17.8: 4 * 10pts (each 10pts for each reward value)

Similar to part 1.1, you need to show me some computation steps, not necessarily all (so that I know you understand the equation), not simply the numbers.

17.9: 2 * 10pts (the first is for correctly derive utilities as functions of \gamma, and the second is for computing the value at which the utility "switches" between Up and Down.)

17.10: 5pts + 2 * 10pts (5pts for your qualitative reasoning about the solution policy, and each 10pts for performing policy iteration for 2 different initial policy).

Part 2:

- 20pts: For your effort on coding, I will have 20pts for your code.

- 10pts: You need to show that you test your code with 17.8 above.

- 2 * 15pts: Each for experiment results and observations in the 4x3 environment. Note that you must have some quantitative results (preferably in graphs, but I also gave partial credits for those having just plain numbers). For comparing the three algorithms VI, PI and modified PI, it is not enough just mentioning their complexity.

Some comments

17.4:
- Some students did not write the Bellman correctly for MDPs with rewards of the form R(s,a) and R(s,a,s'). 

For instance, "U(s) = R(s,a) + max {...}", or "U(s) = R(s,a,s') + max { ... }" ==> These are clearly wrong, since it is not clear which "a" and s' are being used here.

Other mistake:

U(s) = \max_{a} \max_{s'} R(s,a,s') + \max_{a} \sigma_{s'} ==> the "argmax" for the two different max operator can be different!

17.8:
- Some students did not write the Bellman correctly, for instance:
17.10:
- Some students did not do the policy evaluation correctly. Note that you don't need to do iteration here, and the simple idea is that given a policy, corresponding utilities of states (if any) must follow a set of linear equations (quite similar to Bellman equations but without the max, since you know actions taken at states).