[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

CSE-571: Reinforcement Learning project



[Due date: 10/10/2012]

In this project, we consider the 4x3 environment that you're familiar with, described in Figure 17.1, page 646 of the AIMA textbook (3rd edition). To recall: at each non-terminal state, the agent can move Up, Down, Left or Right; the "intended" outcome occurs with probability 0.8, and the other two (at the right angles of the moving direction) with 0.1 probability. The reward for two terminal states (4,3) and (4,2) are +1 and -1; and -0.04 for all non-terminal states.

1) Implement the Q-learning algorithm with the exploration scheme described in Equation (21.5), page 842 in the book, using R+ = 2 and N_e = 5.

2) How many trials that your agent needs in order to converge to the optimal policy? Show optimal actions at each state.

3) Show a graph on how utility estimates change depending on the number of trials (for all states). Using your graph, can you compare Q-learning with the active ADP, equipped with the same exploration mentioned above in terms of convergence speed of utility estimates (note that the performance of active ADP has been discussed with this setting in Figure 21.7).

4) [Optional] The result of Q-learning depends on exploration schemes. Please suggest another way to explore this environment such that your agent is discouraged to explore certain part of the environment (probably after some time steps)---note that this requirement is not "vague" at all as it might sound. Explain why you choose this scheme. Run your Q-learning code with this new scheme. How do you compare the Q-learning algorithm with the two exploration schemes.

If you have questions, please email me, post on the forum, or in my TA office hour (Tuesday, 1-2pm).
Tuan