Summary of MDPs (until Now) Finite-horizon MDPs Non-stationary policy Value iteration Compute V0 ..Vk .. VT the value functions for k stages to go Vk is computed in terms of Vk-1 Policy Pk is MEU of Vk Indefinite horizon MDPs Infinite-horizon MDPs Stationary policy Value iteration

Converges because of contraction property of Bellman operator Policy iteration --Stochastic Shortest Path problems (with initial state given) Proper policies --Can exploit start state Ideas for Efficient Algorithms.. Use search Useheuristic factored representations (and reachability

Factored representations for Actions, Reward Functions, information) Values and Policies LAO*, RTDPmanipulating factored representations during Directly Bellman update Usethe execution and/or Simulation Actual Execution Reinforcement learning (Main motivation for RL is to learn the model) Simulation simulate the given model to sample possible futures

Policy rollout, hindsight optimization etc. Heuristic Search vs. Dynamic Programming (Value/Policy Iteration) Heuristic VI and PI approaches search (e.g. use A*/AO*) Dynamic explores Programming only the part

Update of the state space that actually reachable the initial state Set the value of aisstate in terms of thefrom maximum expected value within Even achievable the reachable

by doing space, actionsheuristic from thatsearch state.can avoid visiting of the for states. They domany the update every state in the state space on the quality of the heuristic used.. Depending Wasteful if we know the initial

state(s) that the agent is Butstarting what isfrom the heuristic? An admissible heuristic is a lowerbound on the cost to reach goal from any given state It is a lowerbound on J*! Real Time Dynamic Programming [Barto, Bradtke, Singh95] Trial: simulate greedy policy starting from start state; perform Bellman backup on visited states RTDP: repeat Trials until cost function converges RTDP was originally introduced for Reinforcement Learning For RL, instead of simulate you execute

You also have to do exploration in addition to exploitation with probability p, follow the greedy policy with 1-p pick a random action Stochastic Shortest Path MDP 0 RTDP Trial Qn+1(s0,a) agreedy = a2 Min a1 Jn+1(s0) s0

a2 Jn Jn ? Jn ? Jn a3 ? Jn Jn Jn Goal Greedy On-Policy RTDP without

execution Using the current utility values, select the action with the highest expected utility (greedy action) at each state, until you Comments Properties if all states are visited infinitely often then Jn J* Only relevant states will be considered A state is relevant if the optimal policy could visit it. Notice emphasis on optimal policyjust because a rough neighborhood surrounds National Mall doesnt mean that you will need to know what to do in that neighborhood Advantages Anytime: more probable states explored quickly

Disadvantages complete convergence is slow! no termination condition Do we care about complete convergence? Think Cpt. Sullenberger Labeled RTDP [Bonet&Geffner03] Initialise J0 with an admissible co st s Jn monotonically increases

Converged means heuristicbellman residual is less than e Q Label a state as solved hi gh Q s best action hi gh

co st s if the Jn for that state has converged G ) J(s) wont change! s ? G t both s and t get solved together Backpropagate solved labeling

Stop trials when they reach any solved state Terminate when s0 is solved Probabilistic Planning --The competition (IPPC) --The Action language.. (PPDDL) Factored Representations: Actions Actions can be represented directly in terms of their effects on the individual state variables (fluents). The CPTs of the BNs can be represented compactly too! Write a Bayes Network relating the value of fluents at the state before and after the action Bayes networks representing fluents at different time points are called Dynamic Bayes Networks We look at 2TBN (2-time-slice dynamic bayes nets) Go further by using STRIPS assumption

Fluents not affected by the action are not represented explicitly in the model Called Probabilistic STRIPS Operator (PSO) model Action CLK Not ergodic How to compete? Policy Computation Off-line policy generation Exe c First compute the whole policy Get the initial state

Compute the optimal policy given the initial state and the goals Then just execute the policy Loop Do action recommended by the policy Get the next state Until reaching goal state Pros: Can anticipate all problems; Cons: May take too much time to start executing Selec t e x

Selec t e x Selec t e x Online action selection Loop Selec t

e x Compute the best action for the current state execute it get the new state Pros: Provides fast first response Cons: May paint itself into a corner.. 1 IPPC & Post-Mortem.. st IPPC Competitors Most IPPC competitors used different approaches for offline

policy generation. One group implemented a simple online replanning approach in addition to offline policy generation Determinize the probabilistic problem Most-likely vs. All-outcomes Loop Get the state S; Call a classical planner (e.g. FF) with [S,G] as the problem Execute the first action of the plan Umpteen reasons why such an approach should do quite badly.. Results and Postmortem To everyones surprise, the replanning approach wound up winning the competition.

Lots of hand-wringing ensued.. May be we should require that the planners really really use probabilities? May be the domains should somehow be made probabilistically interesting? Current understanding: No reason to believe that off-line policy computation must dominate online action selection The replanning approach is just a degenerate case of hind-sight optimization FF-Replan Simple replanner Determinizes the probabilistic problem Solves for a plan in the determinized problem

a3 a2 a1 S a2 a5 a4 a3 a4 G G All Outcome Replanning (FFRA) ICAPS-07

Probability1 Effect 1 Action 1 Effect 1 Effect 2 Action 2 Effect 2

Action Probability2 27 Reducing calls to FF.. We can reduce calls to FF by memoizing successes If we were given s0 and sG as the problem, and solved it using our determinization to get the plan s0a0s1a1s2a2s3ansG Then in addition to sending a1 to the simulator, we can memoize {siai} as the partial policy. Whenever a new state is given by the simulator, we can see if it is already in the partial policy Additionally, FF-replan can consider every state in the partial policy table as a goal state (in that if it reaches them, it knows how to get to goal state..) Hindsight Optimization for

Anticipatory Planning/Scheduling Consider a deterministic planning (scheduling) domain, where the goals arrive probabilistically Using up resources and/or doing greedy actions may preclude you from exploiting the later opportunities How do you select actions to perform? Answer: If you have a distribution of the goal arrival, then Sample goals upto a certain horizon using this distribution Now, we have a deterministic planning problem with known goals Solve it; do the first action from it. Can improve accuracy with multiple samples FF-Hop uses this idea for stochastic planning. In anticipatory planning, the uncertainty is exogenous (it is the uncertain arrival of goals). In stochastic planning, the uncertainty is endogenous (the actions have multiple outcomes) Probabilistic Planning (goal-oriented)

Left Outcomes are more likely Action Maximize Goal Achievement I A1 Probabilistic Outcome A2 Time 1 A1

A2 A1 A2 A1 A2 A1 A2 Time 2 Action State Dead End

Goal State 30 Problems of FF-Replan and better alternative sampling FF-Replans Static Determinizations dont respect probabilities. We need Probabilistic and Dynamic Determinization Sample Future Outcomes and Determinization in Hindsight Each Future Sample Becomes a Known-Future Deterministic Problem 33 Hindsight Optimization (Online Computation of VHS ) Pick action a with highest Q(s,a,H) where

Q(s,a,H) = R(s,a) + ST(s,a,s)V*(s,H-1) Compute V* by sampling H-horizon future FH for M = [S,A,T,R] Mapping of state, action and time (h

But this is still too hard to compute.. Lets swap max and expectation VHS(s,H) = EFH [max R(s,FH,)] max R(s,FH-1,) is approximated by FF plan VHS overestimates V* Why? Intuitively, because VHS can assume that it can use different policies in different futures; while V* needs to pick one policy that works best (in expectation) in all futures. But then, VFFRa overestimates VHS Viewed in terms of J*, VHS is a more informed admissible heuristic..

34 Implementation FF-Hindsight Constructs a set of futures Solves the planning problem using the H-horizon futures using FF Sums the rewards of each of the plans Chooses action with largest Qhs value Left Outcomes are more likely Probabilistic Planning (goal-oriented) Action Maximize Goal Achievement

I A1 Probabilistic Outcome A2 Time 1 A1 A2 A1 A2 A1

A2 A1 A2 Time 2 Action State Dead End Goal State 38 Improvement Ideas Reuse Generated futures that are still relevant Scoring for action branches at each step If expected outcomes occur, keep the plan

Future generation Not just probabilistic Somewhat even distribution of the space Adaptation Dynamic width and horizon for sampling Actively detect and avoid unrecoverable failures on top of sampling Left Outcomes are more likely Hindsight Sample 1 Action Maximize Goal Achievement

I A1 Probabilistic Outcome A2 Time 1 A1 A2 A1 A2 A1

A2 A1 A2 Time 2 Action State A1: 1 A2: 0 Dead End Goal State 40 Exploiting Determinism Longest

prefix for each plan is identified and executed without running ZSL, OSL or FF! Plans generated for chosen action, a* S1 S1 S1 a*

a* a* G G G Handling unlikely outcomes: All-outcome Determinization Assign each possible outcome an action Solve for a plan Combine the plan with the plans from the HOP solutions Relaxations for Stochastic Planning Determinizations can also be used as a basis for heuristics to initialize the V for value

iteration [mGPT; GOTH etc] Heuristics come from relaxation We can relax along two separate dimensions: Relax ve interactions Consider +ve interactions alone using relaxed planning graphs Relax uncertainty Consider determinizations Or a combination of both! Solving Determinizations If we relax ve interactions Then compute relaxed plan Admissible if optimal relaxed plan is computed Inadmissible otherwise If we keep ve interactions

Then use a deterministic planner (e.g. FF/LPG) Inadmissible unless the underlying planner is optimal Negative Interactions Increasing consideration Dimensions of Relaxation 3 4 1 2

Uncertainty 1 Relaxed Plan Heuristic 2 McLUG 3 FF/LPG 4 Limited width stochastic Reducing Uncertainty Bound the number of stochastic outcomes Stochastic width Dimensions of Relaxation -ve interactions Uncertainty None

None Some Relaxed Plan McLUG FF/LPG Limited width Stoch Planning Some Full

Full Expressiveness v. Cost Limited width stochastic planning F F Node Expansions v. Heuristic Computation Cost McLUG Nodes Expanded FFReplan Computation Cost

h= 0 FFR FF Reducing Heuristic Computation Cost by exploiting factored representations The heuristics computed for a state might give us an idea about the heuristic value of other similar states Similarity is possible to determine in terms of the state structure Exploit overlapping structure of heuristics for different states E.g. SAG idea for McLUG E.g. Triangle tables idea for plans (c.f. Kolobov)

A Plan is a Terrible Thing to Waste Suppose we have a plan s0a0s1a1s2a2s3ansG We realized that this tells us not just the estimated value of s0, but also of s1,s2sn So we dont need to compute the heuristic for them again Is that all? If we have states and actions in factored representation, then we can explain exactly what aspects of si are relevant for the plans success. The explanation is a proof of correctness of the plan Can be based on regression (if the plan is a sequence) or causal proof (if the plan is a partially ordered one. The explanation will typically be just a subset of the literals making up the state That means actually, the plan suffix from si may actually be relevant in many more states that are consistent with that explanation

Triangle Table Memoization Use triangle tables / memoization C B A A B C If the above problem is solved, then we dont need to call FF again for the below: B A A B

Explanation-based Generalization (of Successes and Failures) Suppose we have a plan P that solves a problem [S, G]. We can first find out what aspects of S does this plan actually depend on Explain (prove) the correctness of the plan, and see which parts of S actually contribute to this proof Now you can memoize this plan for just that subset of S Factored Representations: Reward, Value and Policy Functions Reward functions can be represented in factored form too. Possible representations include Decision trees (made up of fluents) ADDs (Algebraic decision diagrams)

Value functions are like reward functions (so they too can be represented similarly) Bellman update can then be done directly using factored representations.. SPUDDs use of ADDs Direct manipulation of ADDs in SPUDD