----------------------------------------------------------------------- Sequential Decision Problems and Markov Decision Processes Notes by: Leanne Vander Meer Dunn Revised by: Rao [Dec 5, 1995] Sources: Artificial Intelligence - A Modern Approach Stuart Russell and Peter Norvig Chapters 16, 17, 20 Planning and Learning Class Lecture November 29 and December 4, 1995 ----------------------------------------------------------------------- This paper describes how to solve a sequential decision problem. However, before we can solve it, we need to know what it is. So first, some background material. Definitions ----------- A sequential decision problem is one where the agent has a set of world states it can inhabit, a set of actions it can take in any world state, and there is a specific reward for being in any state. A policy is a mapping between states and actions. Given a state, the policy tells the agent what action to perform. The objective of SDP is to find the optimal policy--i.e., the policy which gives the maximum accumulated reward for the agent. A Markov decision problem (MDP) is the problem of calculating an optimal policy in an accessible (observable), stochastic environment with a transition model that satisfies Markov property (i.e., the transitions depend only only the current state, and not the states that the agent visited on its way to this state). Motivation ---------- In previous class discussions, we have talked about ways to relax the accessibility assumption, the instantaneous action assumption, and the deterministic action assumption. However, the "goals" of the planner have remained the same. We still have achievement goals and maintenance goals. But what if you need to represent a more expressive goal? For instance, you might want a plan that takes the smallest possible number of actions to reach a given state. Or you might want a plan that that reaches its goal without ever passing through a "bad states". To support such goals, we first need to have a methodology for checking if a plan "satisfies" them. We briefly talked about value functions in our discussions concerning Dynamical Systems. Now it's time to put them to work. Value Functions --------------- As previously discussed, a dynamical system takes its current state and an action as input, and produces another state. This new state can then be passed back into the next pass of the dynamic system. In so doing, the dynamical system exhibits a state sequence (also known as a history, a behavior). A value function assigns a real-valued number to a state sequence. V() = R A goal-of-achievement value function might look like this: | 1 --> G is true in sn V() = | | 0 --> G is not true in sn A more general value function can depend on all the states in the state sequence (instead of just on the goal). Given a dynamical system D, and a value function V, you can assess the performance of the system by essentially letting it run, consider the state sequences it produces, compute their values, average the values over a bunch of runs of D. We know that the plan/policy of the dynamical system changes the state sequences that the system will exhibit. Thus, the objective of planning becomes coming up with a plan/policy under which the assessed value of the dynamical system will be maximal. One brute-force way of finding the optimal plan will thus be to come up with all different plans, compare them in terms of the value of the dynamical system under that policy, and take the plan that gives rise to the best value. So where do we get a value function? We need a way to convert the goals in to a value function. We can do this indirectly in terms of "Reward" functions. A reward assigns a real-valued number to a state (rather than a state sequence). Here's an example of a reward for maintaining an attribute g in state s: | -1 if g does-not-hold in s R(s) = | | 0 if g does-hold in s In the above example, if you violate the requirement that you maintain a particular attribute, you get a large negative reward. Once we have reward functions, we can compute value functions in the obvious way: V () = R(s1) + R(s2) + .. R(sn) To handle infinitely long sequences of states, and also to discount future promise with respect to past achievement, typically we consider a discounted value function: V() = SUM r^i R(si) (where 0<= r < 1) i=1..n ------- REWARD FUNCTIONS RESTRICT OUR ATTENTION TO A SPECIFIC CLASS of VALUE FUNCTIONS There are some important assumptions inherent in reward functions. Specifically, since the reward for each state is defined independent of the other states, value functions defined in terms of reward functions have the "SEPARABILITY" property--that is the value function of a state sequence will be the sum of the value function on and value function on . NOT ALL VALUE FUNCTIONS SATISFY SEPARABILITY PROPERTY. As an example, suppose your value function is such that: V() = a V() = 0 V() = 0 --------------- Reward functions also allow us to combine goals in a natural way. Here's an example where you want to achieve either g1 or g2 as ealy as possible. What is the reward for state s? R(s) = MAX[ R(s,g1), R(s,g2) ] --------------- Markov Decision Problems ------------------------ A dynamical system specified in terms of a transition model (ie., a model which gives, for each state s and action a, the distribution of states that will result if a was executed in s), and a reward function is called a Markov Decision Process. The objective in MDP is to find the optimal policy for the dynamical systems. An MDP requires, as input: 1) a set of states, 2) a set of actions, and 3) a reward function. The reward function tells us what reward to expect when in a given state. The actions are represented in terms of transition probabilities from one state to another state. For example, when in state s1, action a1 might have probability 0.9 of taking you to state s2. Example: I'll explain how to solve a Markov decision problem through a series of examples, all using the same environment. Introducing grid world: +-------+-------+-------+ | | | | 2 | | | +1 | | | | | +-------+-------+-------+ | | | | 1 | | | -1 | | | | | +-------+-------+-------+ | | | | 0 | | | | | | | | +-------+-------+-------+ 0 1 2 Each square is a state. The squares with +1 and -1 are termination states. You want to reach the +1 state and stay away from the -1 state. So, somehow, you need to know what to do in each of the others states. In this environment, you have actions: up, down, left, right, where each action moves the agent one position in the specified direction. However, we want to learn about handling uncertainty, so let's make the outcomes non-deterministic. Lets say that there is an 80% chance that the operator will move the agent in the specified direction, and a 20% chance that the operator will move the agent at a right angle to the specified direction. For instance, if the agent executes the action up, then the agent has an 80% chance of moving up, a 10% change of moving left, and a 10% chance of moving right. These probabilities describe a transition model. In a problem like this, the utility of a state really depends on where you can get from that state. Therefore, the utility of the state depends on the utilities of a sequence of states (an environment history). A policy is a complete mapping from states to actions. You can calculate a policy from a transition model (the probabilities) and a utility function. Once you have a policy, then the agent simply does what the policy says for a given state. In an accessible environment, the agent's observations at each step will identify the current state. In an inaccessible environment, the agent's observations do not provide enough information to identify the current state or the associated transition probabilities. As stated above, a Markov decision problem (MDP) is the problem of calculating an optimal policy in an accessible, stochastic environment with a known transition model. The Markov property is said to hold if the transition probabilities from any given state depend only on the state and not on previous history. A partially observable Markov decision problem (POMDP) is the problem of calculating an optimal policy in an inaccessible environment. These problems generally don't use the same kinds of solutions as the MDPs, and this paper will not cover them. We are interested in MDPs, so the environment is accessible and the transition model is known. Given an MDP and a reward function, we can calculate an optimal policy. Specificaly, an optimal policy directs the agent towards the next possible state that has the highest promise i.e., will lead to state sequnces whose expected value will be the highest. The promise of a state is called the utility of the state. To understand the utility of a state, and how it fits in to an MDP, we need a little more background. -----------Digression onto utility theory----------- Decision Theory --------------- When making decisions, you not only need to know what is a good effect and what is a bad effect, but also how likely it is to happen. Decision theory describes how to combine utility theory with probability so that an agent can select actions that will maximize its expected performance. Probability theory describes what an agent should believe on the basis of evidence. Utility theory describes what an agent wants. So, decision theory lets an agent figure out what do to based on probability and utility. This leads to a discussion on utility theory. Utility Function ---------------- An agents preference between world states is represented by a utility function, which assigns a number to each state to represent the desirability of the state. An actions can also have a utility. You combine the state utility with the possible outcomes of the action to get an expected utility for the action. The expected utility is defined as: EU(A) = SUM [ P(O(A,i)) * U(O(A,i)) ] This is the expected utility of action A given the agents current situation. Following is a detailed explanation of this equation. U(S) is the utility of state S (the utility is a number). For now, just assume that you have these values. In the above equation, the utility refers to the state that the agent will be in after the action. O(A,i) is the i-th possible outcome of action A. This refers to a state. For example, if the agent moves a step forward, the agent might stay in the new position, or the agent might slide backwards to its original position (maybe if it was on a slope). So there are two possible outcome states in this case. Continuing with the example, the action move from position P0 to position P1 could result in the agent being in either position P0 or in position P1. O(move,1) = P1 --> Agent Stays In New Position O(move,2) = P0 --> Agent Slides Backwards Remember that U(S) is the desirability of being in state S (according to the agent). For this example, lets say: U(O(move,1)) = U(P1) = 0.88 U(O(move,2)) = U(P0) = 0.76 P(O(A,i)) is the probability of action A resulting in outcome i. So, given the above example, there might be an 80% chance that the agent will stay in the new state, and a 20% chance that the agent will slide backwards to the old state. =09 P(O(move,1)) = P(P1) = 0.80 P(O(move,2)) = P(P0) = 0.20 And finally, putting it all together, the expected utility is: EU(A) = SUM [ P(O(A,i)) * U(O(A,i)) ] EU(move) = P(O(move,1)) * U(O(move,1)) + P(O(move,2)) * U(O(move,2)) = 0.80 * 0.88 + 0.20 * 0.76 So now you know the expected utility of attempting to move from state P0 to state P1. At this point, we still don't know how to get a utility function. We do know how to use it to calculate the expected utility of an action. Following are a few more interesting points that will help us with utility functions. If you have a performance measure for the agent, then you want a utility function that reflects the desired results of the performance measure. This will give the agent the optimal behavior (according to the performance measure). The scale of a utility function doesn't really matter. For instance, a new utility function U'(S) could be defined as: U'(S) = (some constant) + (some positive constant) * U(S) To an agent, U(S) and U'(S) mean pretty much the same thing. The relationship of the individual utility values is what matters. For instance, the action move-right might have utility 1.0, and the action move-left might have the utility 0.5. If you multiplied them both by two so that U(move-right) = 2.0 and U(move-left) = 1.0, the action move-right is still preferable over the action move-left because it has a higher utility value in both cases. In multi-attribute utility theory, the outcomes depend on two or more attributes. U(x1, ..., xn) = f(f1(x1), ..., fn(xn)) This discussion emphasizes single-attribute utility functions, which will be applicable in policy generation. Therefore, this paper does not go into an example of a multi-attribute utility function. Suffice to say, they do exist. A utility function is separable if and only if we can find a function f such that U() = f(s0, U()) For instance U() = R(s0) + U() Now that we have some background in decision theory and utility theory, it's time to tackle the main event. ----------------------------------- Computing optimal policies for Markov Decision Processes: ---------------------------------------------------------- You might, theoretically, be able to enumerate all of the possible policies for a state space, and then pick the one with the maximum expected value function. However, this can readily blow up since the number of policies is exponetial in the size of the state space. Instead, the methods of policy generation take advantage of the optimal substructure principle. This means that the optimal policy will also be locally optimal for each individual state. Suppose you are in state s. What should be the action you should do? We want to do the action such that the state sequences that the dynamical system exhibits starting from s, under this policy, will have the maximum value. Let us define the utility U of the state s to be the maximal expected value of state sequences starting from state s. IF we know U(s) for all states, then we can use it to decide the optimal policy. The optimal policy will be to do action a from state si if the quantity M(a, si, sj) * U(sj) is maximum for a. So, how do we find the utilty values for all states? Here is what we know: 1. Utility values for terminal states (i.e., states from which no actions are possible) are known a priori (they are just equal to the value of the reward function at that state. 2. The utility value of a state si is related to the utility value of its neighboring states in terms of the transition probabilities as follows: U(i) = R(i) + max(a) SUM [M(a,i,j) * U(j)] The utility of the state is the reward of the state plus the value of the best option (which represents a history, or state sequence). For the state i, U(i) is the optimal value. In writing this equation, we have used the fact that optimal policy will involve doing the optimal thing at each state. To develop a history, you iteratively use the utility equation so that the rewards of the terminal states have the opportunity to propagate out through all the other states (essentially to form a history from each state to each terminal state). One way to solve for the utility values is to use dynamic programming. Dynamic programming is an efficient way to solve a set of equations such as these. It gives exact utility values right away, but it can be costly in large state spaces. We wil not discuss the details of dynamic programming. Two other methods of calculating optimal policies in MDPs are value iteration and policy iteration. Both these methods essentially try to modify the utilities of the neighboring states such that they will satisify the utility recurrence equations. If this local modification process is repeated at each state in the state space, and for enough number of iterations, then finally the utilities of the individual states will converge to their correct values. Following is a detailed description of each method. Value Iteration --------------- The basic idea is to calculate a utility for each state (given a transition model and a reward function), and then to use these utilities to create a policy. The utility of a state somehow needs to depend on where you can get to from that state. Therefore, the utility of the state depends on the utilities of a sequence of states (an environment history). The utility of a state is its reward value for being in that state, plus the maximum expected utility of the state sequences possible from that state. U(i) = R(i) + max(a) SUM [M(a,i,j) * U(j)] If there is only one possible action a from state i, then U(i) = R(i) + SUM [M(a,i,j) * U(j)] Following is a breakdown of what this equation means. 1) M(a,i,j) is the probability of reaching state j when executing action a in state i. Example: In the grid world, the probability of reaching state (0,1) from state (0,0) is 0.8 when executing action up. The probability of reaching state (1,1) from state (0,0) is 0 for all actions (because each action moves the agent by at most one square). 2) M(a,i,j) * U(j) is the expected utility of state j when executing action a in state i. Example: If the utility of state (0,1) is 0.75, then the expected utility of state (0,1) when executing action up from state (0,0) is 0.8 * 0.75. 3) SUM [M(a,i,j) * U(j)] is expected utility of action a when executing action a in state i. This is the sum of the expected utilities when executing action a in state i. Example: Following is a list of expected utilities when executing action left in grid world: EU(left(0,0) -> (1,0)) = EU(1,0) = 0.68 EU(left(0,0) -> (0,1)) = EU(0,1) = 0.13 EU(left(0,0) -> (0,0)) = EU(0,0) = 0.15 EU(left(0,0) -> (*,*)) = EU(*,*) = 0 Then the expected utility of action left in state (0,0) is EU(left) = 0.68 + 0.13 + 0.15 = 0.96 4) max(a) SUM [M(a,i,j) * U(j)] is the maximum expected utility of state i when executing actin a in state i. Example: Following is a list of the expected utilities of actions when executed from state (0,0): EU(left) = 0.96 EU(right) = 0.82 EU(up) = 0.92 EU(down) = 0.66 Then the maximum expected utility of state (0,0) is 0.96 (the expected utility of action left). 5) U(i) = R(i) + max(a) SUM [M(a,i,j) * U(j)] is the utility of state i. Example: If state (0,0) has reward -0.04, then its utility is -0.04 + 0.96 = 0.92 The value iteration function takes a transition model M and a reward function R, and returns a utility function U. Following is the value-iteration algorithm, and then a detailed example. utility-function VALUE-ITERATION (transition-model M, reward-function R) { utility-function U = R utility-function U-copy do { U-copy = U for-each-non-terminal-state i do U[i] = R[i] + max(a) SUM [M(a,i,j) * U-copy(j)] } while (U is-not-close-to U-copy) return U } Notice that by using a copy of the utility function to do the calculations, the new utility function values always depend on the values of the previous iteration. This means that, during an iteration, the values that you calculate will not interfere with each other. Example of Value Iteration: This example uses the world from above where there is an 80% chance that the operator will move the agent in the specified direction, and a 20% chance that the operator will move the agent at a right angle to the specified direction. The reward for a state is R = -0.04. As specified by the algorithm, the initial utility function is equal to the reward function. +-------+-------+-------+ | | | | | -0.04 | -0.04 | +1 | | | | | +-------+-------+-------+ | | | | | -0.04 | -0.04 | -1 | | | | | +-------+-------+-------+ | | | | | -0.04 | -0.04 | -0.04 | | | | | +-------+-------+-------+ Next we show the utility function after one iteration. +-------+-------+-------+ | | | | | -0.08 | +0.75 | +1 | | | | | +-------+-------+-------+ | | | | | -0.08 | -0.08 | -1 | | | | | +-------+-------+-------+ | | | | | -0.08 | -0.08 | -0.08 | | | | | +-------+-------+-------+ Notice that state (1,2) is the only state that has a positive utility value. It is currently the only state that has a known path to the positive termination state. As the iterations continue, the other states will benefit from the high utility value of the positive termination state. Following is the trace of how the algorithm gets the utility value for state (1,2): M(up,(1,2),(1,2)) * U-copy(1,2) = 0.80 * -0.04 = -0.032 M(up,(1,2),(0,2)) * U-copy(0,2) = 0.10 * -0.04 = -0.004 M(up,(1,2),(2,2)) * U-copy(2,2) = 0.10 * 1.0 = 0.10 M(down,(1,2),(1,1)) * U-copy(1,1) = 0.80 * -0.04 = -0.032 M(down,(1,2),(0,2)) * U-copy(0,2) = 0.10 * -0.04 = -0.004 M(down,(1,2),(2,2)) * U-copy(2,2) = 0.10 * 1.0 = 0.10 M(left,(1,2),(0,2)) * U-copy(0,2) = 0.80 * -0.04 = -0.032 M(left,(1,2),(1,2)) * U-copy(1,2) = 0.10 * -0.04 = -0.004 M(left,(1,1),(1,1)) * U-copy(1,1) = 0.10 * -0.04 = -0.004 M(right,(1,2),(2,2)) * U-copy(2,2) = 0.80 * 1.0 = 0.80 M(right,(1,2),(1,2)) * U-copy(1,2) = 0.10 * -0.04 = -0.004 M(right,(1,1),(1,1)) * U-copy(1,1) = 0.10 * -0.04 = -0.004 up --> SUM = -0.032 + -0.004 + 0.10 = 0.064 down --> SUM = -0.032 + -0.004 + 0.10 = 0.064 left --> SUM = -0.032 + -0.004 + -0.004 = -0.04 right --> SUM = 0.80 + -0.004 + -0.004 = 0.792 Since action right has the maximum expected utility, we use the value of action right in the final calculation of the utility for state (1,2). U(1,2) = R(1,2) + EU(right) = -0.04 + 0.792 = 0.752 (rounded to 0.75 in the figure) Here is the utility function after a second iteration. +-------+-------+-------+ | | | | | +0.55 | +0.83 | +1 | | | | | +-------+-------+-------+ | | | | | -0.12 | +0.45 | -1 | | | | | +-------+-------+-------+ | | | | | -0.12 | -0.12 | -0.12 | | | | | +-------+-------+-------+ Following is final utility function, after many iteration. Even if you continue to perform the value iteration algorithm on this utility function, the values will not chance. They have converged to their final values. +-------+-------+-------+ | | | | | +0.87 | +0.93 | +1 | | | | | +-------+-------+-------+ | | | | | +0.82 | +0.78 | -1 | | | | | +-------+-------+-------+ | | | | | +0.76 | +0.72 | +0.49 | | | | | +-------+-------+-------+ Once you have the utility function, you can compute the policy: Policy(i) = arg max(a) SUM [M(a,i,j) * U(j)] This equation determines the action for a single state. The equation has many similarities to the equation for a state's utility, so we'll go straight to an example for state (0,0): M(up,(0,0),(1,0)) * U(1,0) = 0.8 * 0.82 = 0.656 M(up,(0,0),(0,1)) * U(0,1) = 0.1 * 0.72 = 0.072 M(up,(0,0),(0,0)) * U(0,0) = 0.1 * 0.76 = 0.076 M(down,(0,0),(0,0)) * U(0,0) = 0.8 * 0.76 = 0.608 M(down,(0,0),(1,0)) * U(1,0) = 0.1 * 0.72 = 0.072 M(down,(0,0),(0,0)) * U(0,0) = 0.1 * 0.76 = 0.076 M(left,(0,0),(0,0)) * U(0,0) = 0.8 * 0.76 = 0.608 M(left,(0,0),(0,1)) * U(0,1) = 0.1 * 0.82 = 0.082 M(left,(0,0),(0,0)) * U(0,0) = 0.1 * 0.76 = 0.076 M(right,(0,0),(0,1)) * U(0,1) = 0.8 * 0.72 = 0.576 M(right,(0,0),(1,0)) * U(1,0) = 0.1 * 0.82 = 0.082 M(right,(0,0),(0,0)) * U(0,0) = 0.1 * 0.76 = 0.076 up --> SUM = 0.656 + 0.072 + 0.076 = 0.804 down --> SUM = 0.608 + 0.072 + 0.076 = 0.756 left --> SUM = 0.608 + 0.082 + 0.076 = 0.766 right --> SUM = 0.576 + 0.082 + 0.076 = 0.734 Action up has the highest expected utility, so the policy uses action up for state (0,0). Below is the completed policy. +-------+-------+-------+ | | | | | right | right | +1 | | | | | +-------+-------+-------+ | | | | | up | left | -1 | | | | | +-------+-------+-------+ | | | | | up | left | left | | | | | +-------+-------+-------+ The terminal states do not have associated actions, either because there is no way to get out of the state, or because you don't want the agent to leave the state. (In the second case, there might exist an action that would get the agent out of the state, but you prevent it from being used in the policy.) Notice that the action for the middle square is left instead of up. While going left leads to a longer path, it is a safer path because there is no chance of accidentally ending up in the negative terminal state. What happens when the cost of a step is higher? This corresponds to a larger negative reward. Below is the policy resulting from using a reward of R = -0.1. +-------+-------+-------+ | | | | | right | right | +1 | | | | | +-------+-------+-------+ | | | | | up | up | -1 | | | | | +-------+-------+-------+ | | | | | up | up | left | | | | | +-------+-------+-------+ Notice that the action for the middle square is now up instead of left. The policy decided to take the risk of accidentally landing in the negative terminal state. There is still one part of the algorithm that hasn't been explained. When does the iteration stop? In the ideal case, the utility function will converge to static values. In this case, you can compare U to U-copy to see if the utility function has changed. But what do you do if the utility function does not converge? This can happen if the state space is very large, or if a cycle develops in the path (essentially, the length of the action sequence is unbounded). One possible solution is to use discounting. The basic idea is that the value of the reward is reduced down to nothing over time. Eventually, the utility function will converge. If you just want to make sure that you don't calculate forever, then you might want to use an iteration limit (stop at n number of iterations). Policy Iteration ---------------- The basic idea is to start with some policy, and then to repeatedly calculate the utility function for that policy, and then to calculate a new policy, and so on. This takes advantage of the fact that the policy often converges long before the utility function converges. In particular, small changes in utility values of the individual states may not have a significnat effect on the policy. The idea of policy iteration is to --start with the policy from the previous iteration (initially, it is set to some default locally optimal actions for each state). -- Compute the utility values of all states given this policy. This can be done by solving the equations: U(si) = R(si) + SUM M(P(si), i , j) * U(sj) where P(si) is the action to be taken in state si given the current policy. -- Now, turn around and check if under these utility values, the current policy tells the locally optimal decision for each state. The reason this makes sense is that the utiltiy values for a policy are globally determined by the whole policy, and thus can even out any local inconsistencies in the policy. Following is the policy-iteration algorithm. policy POLICY-ITERATION (transition-model M, reward-function R) { utility-function U = R policy P // Initialize the policy according to the utility function. for each-non-terminal-state i do P[i] = arg max(a) SUM (M(a,i,j) * U[j]) policy-unchanged = true do { // Calculate the utility function according to the policy. U = VALUE-DETERMINATION(P,U,M,R) // If there is an action that has a higher expected // utility than the action currently in the policy, // then put the new action in the policy. for each-non-terminal-state i do { v1 = max(a) SUM (M(a,i,j) * U[j]) v2 = max(a) SUM (M(P[i],i,j) * U[j]) if v1 > v2 { P[i] = arg max(a) SUM (M(a,i,j) * U[j]) policy-unchanged = false } } } while (policy-unchanged) return P } The function VALUE-DETERMINATION can be implemented just like the VALUE-ITERATION function except that the utility values are calculated according to the following equation: U[i] = R[i] + SUM (M(P[i],i,j) * U-copy[j]) Note that instead of max(a), this equation uses the action from the policy. The VALUE-DETERMINATION step can also be solved as a system of linear equations. Since the equations used in policy iteration are very similar to those used in value iteration, we won't go through an example. Dealing With a Large State Spaces -------------------------------- If you don't want to compute a complete policy for a large state space, it is possible to compute a partial policy using envelope expansion. The basic idea is that you start out with a subset of the full state space, and consider any state that is not in that subset to belong to a state called "other". We now compute the optimal policy for this restricted MDP. Next, we can enlarge the MDP by bringing out some of the states in teh "OTHER" state back into the MDP. The states that we bring in can be selected based on the likelyhood that the current optimal policy will push the agent into them. This is essentially a way of expanding the policy to include likely nearby states, and is similar to the technique we discussed in the context of replanning. This technique is used by Dean et. al. to do autonomous robot navigation. They constructe the initial restricted MDP by using something like BURIDAN which computes a single plan from the goal state to the initial state. OTHER ISSUES: ------------ Modeling Realworld problems like autonomous navigation in terms of MDPs raises more issues. For example, the agent may have faulty sensors and thus may not know exactly what state it is in. Thus we have to handle _partially_ observable markov processes. Further more, many times, the agent doesn't have a complete model of its environment -- ie the transition table for all actions and all states. This necessitates that the agent _learn_ the transition model through experience. The area of reinforcement learning addresses this issue. Relation between Sequential Decision Problems and Buridan: ---------------------------------------------------------- Buridan does probabilistic planning, and thus its problem can also be represented as a sequential decision problem in terms of MDPs. However, it turns out that Buridan approach is much more efficient compared to the MDP approach where both are applicable. Buridan approach is applicable in scenarios where we only have goals of achievement, and are attempting to increase the probability of satisfaction of the goals. Under these assumptions, the reward function has zero value for everything other than the goal state and the best action to be done in a given state is a function only of that state and its relation to the goal state. Thus, a backward chaining approach, used in Buridan, will be enough. In general sequential decision processes on the othe hand, the best action to be done in a state will depend not only on the "goal state" but all the other states in the state space since all of them may have non-zero reward values. In such cases, backward chaining is not enough and we essentially have to do the dynamic programming and solve for utilties of _all_ states. So, the work done in solving an MDP is always proportional to the size of the state space (while the work done in Buridan may be much smaller than the order of the size of the state space). Conclusion ---------- In review, we've looked at Markov decision problems, including utility functions and optimal policies. The definition of an MDP includes an accessible, stochastic environment with a known translation model. One remaining question is, how do you get a translation model? Since the environment is accessible, then one possibility is to use training sequences. This entails running the agent through a bunch of test sequences in the environment, and recording how often the agent moves from one particular state to another. You can build a transition model from the gathered statistics. It might also be possible to observe the reward for each state. So, given an agents actions, you can get a reward function and transition model. From these you can construct the utility function that represents an agent's actions. Any finally, from all of this you can calculate a policy which solves a Markov decision problem. --------------------------------- END ----------------------------------