Announcements No AI seminar this week Extra office hours today 3-4pm (DL580) Last time: Monte Carlo methods for RL Get a whole bunch of sample action sequences (episodes) Average out the rewards achieved to get values for each state/action Generalized Policy Iteration 1) Evaluate the current policy by sampling experiences/episodes 2) Use updated values to improve the policy 3) Wash, rinse, repeat Temporal-Difference (TD) Learning Combine sample-based learning of Monte Carlo and dynamic programming efficiency of lookaheads Basic Idea Use lookahead with current estimates to learn from every experience!

Start with evaluation of fixed policy first Then use GPI to learn new policies TD learning: What we need Observed reward s Previous state a Action we took to get here R(s) s

Current state Actions we can take next TD evaluation Update current utility estimate for state s every time we take an action from it: TD error at this timestep +1 ( ) = ( ) + [ ( ) + ( ) ( ) ] Learning rate How much do we learn from each sample?

s is the outcome of taking action TD evaluation Update current utility estimate for state s every time we take an action from it: ( ) ( ) ( )

= ( ) + + [ +1 ( )]

+1 ( ) =( 1 ) ( )+ [ ( ) + ( ) ] If we extend the lookahead to the end of the sequence, get Monte Carlo estimation In the limit, will converge to true expected utility of s And faster than Monte Carlo! Finite lookahead allows dealing with infinite experience sequences! Effect of learning rate gives an exponential moving average The running interpolation update: Makes recent samples more important: Forgets about the past (distant past values were wrong anyway) Decreasing learning rate make averages converge over time

Example: Temporal Difference learning States Observed Transitions B, east, C, -2 A B C 0 D E 0

0 0 C, east, D, -2 0 8 -1 0 0 0 8 -1 3

0 Assume: = 1, = 1/2 +1 ( ) =( 1 ) ( )+ [ ( ) + ( ) ] 8 State utilities vs Q values We define the utility of a state as the maximum Q value over its actions: Under a deterministic policy, , so this becomes Exploration with -greedy policies To improve on the current policy , we need to know Q values for all (s,a) pairs,

not just what says. Instead of deterministically following , introduce some random exploration! Let be our exploration factor. Then, the action we choose in state s is: , = ( ) h ( ) h { This is called an -greedy policy Sarsa: learning policies with TD Like before, now estimate Q values for state-action pairs instead of state utilities +1 ( , ) ( , )+ [ ( ) + ( , ) ( , ) ] Uses all of s, a, R, s, a

Hence, Sarsa a, a both chosen by current (-greedy) policy Now, can use updated Q values to extract new policy ! Example: Sarsa States Decisions/Experiences B, east, C, -2 A B C C, east, D, -2 C, east

0 D E 0 0 1 -1 -1 1 4 1 0 2 4

8 0 0 0.5 -1 -1 1 4 1 0 2 4 8 0

0 0.5 -1 -1 1 4 1 4 8 4 Assume: = 1, = 1/2 +1 ( , ) =( 1 ) ( , ) + [ ( ) + ( , ) ]

On-policy vs off-policy learning In Sarsa, we choose a specific next action a for updating Q(s,a). This is called on-policy learning, because were using the current policy to help decide an action for learning from. Like policy iteration last time We can also look at all possible next actions A(s), and use the best one to update Q(s,a). This is off-policy learning, because the action we use for updating is separate from the action we actually take. Similar to value iteration, but active Q-learning: off policy TD To update Q(s,a), now use the estimated value for the best next action: +1 ( , ) ( , )+ ( ) + max ( , ) ( , ) [

( ) a, is the action we actually took to get here All available a actions are considered After update, use current (-greedy) policy to choose the actual a to take This will estimate optimal Q values regardless of the quality of the policy being followed! ] Example: Q learning States Decisions/Experiences B, east, C, -2

A B C C, east, D, -2 C, east 0 D E 0 0 1 -1

-1 1 4 1 0 2 4 8 0 0 1.5 -1 -1 1 4

1 0 2 0 8 0 0.5 -1 4 -1 1 4 1

4 8 4 Assume: = 1, = 1/2 [ +1 ( , ) = ( 1 ) ( , ) + ( ) + max ( , )

( ) ] Properties of TD learning Both Sarsa and Q learning improve the Q value estimates at each step Note this also improves the policy, which is dynamically recalculated at every step! With a couple assumptions, they will converge to the optimal policy/values, given infinite experiences Assume learning rate decreases over time (we know what were doing, eventually) Exploration factor must be high enough (though not too high!) Note: in the limit, it doesnt matter how you select actions!

Which is better? What are the advantages of on-policy vs off-policy learning? On-policy is a bit more realistic (especially with large action space) Off-policy sounds more optimal (but isnt in the limit) Mostly: ??? Making it (practically) better Better exploration with exploration functions Measuring mistakes with regret Better generalization with feature functions How to explore?

Several schemes for forcing exploration Simplest: random actions (-greedy) Every time step, flip a coin With (small) probability , act randomly With (large) probability 1-, act on current policy Problems with random actions? You do eventually explore the space, but keep thrashing around once learning is done One solution: lower over time Another solution: exploration functions Exploration functions When to explore? Random actions: explore a fixed amount Better idea: explore areas whose badness is not (yet) established, eventually stop exploring Exploration function Takes a value estimate u and a visit count n, and

returns an optimistic utility, e.g. ( , )=+ min ( , 1 ) Modified Q update: +1 ( , ) ( , )+ ( ) + max ( ( , ) , ( , ) ) ( , ) [

( ) Note: this propagates the bonus back to states that lead to unknown states as well! ] Regret Even if you learn the optimal policy, you still make mistakes along the way! Regret is a measure of your total mistake cost: the difference between your (expected) rewards, including youthful suboptimality, and optimal (expected) rewards Minimizing regret goes beyond learning to be optimal it requires optimally learning to be optimal Example: random exploration and exploration functions both end up optimal, but random exploration has higher regret

Approximate Q-learning with feature functions Generalizing across states Basic Q-Learning keeps a table of all q-values In realistic situations, we cannot possibly learn about every single state! Too many states to visit them all in training Too many states to hold the q-tables in memory Instead, we want to generalize: Learn about some small number of training states from experience Generalize that experience to new, similar situations This is a fundamental idea in machine learning, and well see it over and over again Example: Pacman Lets say we discover through experience

that this state is bad: In nave q-learning, we know nothing about this state: Or even this one! Back in time: Search vs Logic Logic Reason about the world to identify good actions Space of attributive states Successor function defined over attribute values Successors effectively partial states (what has changed?) Guided by knowledge about the world Pos: [1,1] Dot1: True Dot2: True

Pos: [1,2] Dot4: False Pos: [2,1] Dot8: False Feature-based representations Describe a state using a vector of features (attributes) Features are functions from states to real numbers (often 0/1) that capture important properties of the state Example features: Distance to closest ghost

Distance to closest dot Number of ghosts 1 / (dist to dot)2 Is Pacman in a tunnel? (0/1) etc. Is it the exact state on this slide? Can also describe a q-state (s, a) with features (e.g. action moves closer to food) Linear Value Functions Using a feature representation, we can write a q function (or value function) for any state using a few weights: ( )=1 1 ( ) + 2 2 ( ) + + ( ) ( , )=1 1 ( , ) +2 2 ( , ) + + ( , ) Advantage: our experience is summed up in a few powerful numbers Disadvantage: some states may be similar in features, but we want them to have very different values!

Approximate Q-Learning Q-learning with linear Q-functions: Exact Qs Approximate Qs Intuitive interpretation: Adjust weights of active features E.g., if something unexpectedly bad happens, blame the features that were on: disprefer all states with that states features Formal justification: online least squares Example: Q-Pacman Policy Search

Problem: often the feature-based policies that work well (win games, maximize utilities) arent the ones that approximate U / Q best Q-learnings priority: get Q-values close (modeling) Action selection priority: get ordering of Q-values right (prediction) Well see this distinction between modeling and prediction again later in the course Solution: learn policies that maximize rewards, not the values that predict them Policy search: start with an ok solution (e.g. Q-learning) then fine-tune by hill climbing on feature weights Policy Search Simplest policy search: Start with an initial linear value function or Q-function Nudge each feature weight up and down and see if your policy is better than before Problems: How do we tell the policy got better? Need to run many sample episodes! If there are a lot of features, this can be impractical

Better methods exploit lookahead structure, sample wisely, change multiple parameters Next Time First half review Genetic algorithms