Here is some information on the Reinforcement Learning project: question 1 simply asked for the implementation, so I didn't grade it. Each of Q2 and Q3 has 20 points, and the optional Q4 has 10 points (it will be kept and used separately later).
For Q2, since you may use different values for learning rate and discounted factors (and some students vary N_e as well), plus implementation may be different in the way you randomly restart a new state, I think it is reasonable to see different answers for the number of trials the agent need to converge. That said I didn't grade this with a fixed criterion in mind, but relying on:
- the shape of the graph of how utility values change over iteration: utilities of states (1,3), (2,3) and (3,3) should be quickly go up after 200 trials; a bit lower should be those for (1,1), (1,2) and other 4 remaining states; the utilities of (4,1) and (3,2) can still keep changing even after 1000 trials, and thus optimal actions in these states are quite unstable (one explanation is that the Q-value for actions at those states are quite close)
- the optimal actions returned
If your report is two far from this, for instance having utilities of many states "converge" at negative values, or optimal actions at easy states such as (2,1), (3,1) are wrong, I subtracted some points from this part (it would be worse if best actions for (1:3,3) states are wrong). I didn't subtract anything if you're just wrong on the two states (4,1) and (3,2) (please let me know if I did).
For Q3, the shape of your graph should be as I described, and from that it is obvious that ADP is faster than Q-learning (in terms of utility convergence). There're student(s) reporting opposite results but still commenting that ADP is faster, and I had to assume that your implementation of Q-learning had issues...
For Q4, I saw various suggestion for different exploration functions. If you could explain the intuition behind yours, and you had experiment results consistent with that, you got full credits. I gave partial credits for those who didn't test their new functions.
Here is some statistics (for Q2 + Q3):
Max: 40
Min: 15
Avg: 32.7
Stdev: 8.4
Please let me know if you have any questions.
Tuan