CSE 3521: Intro to Artificial Intelligence - web.cse.ohio ...

CSE 3521: Intro to Artificial Intelligence - web.cse.ohio ...

Announcements No AI seminar this week Extra office hours today 3-4pm (DL580) Last time: Monte Carlo methods for RL Get a whole bunch of sample action sequences (episodes) Average out the rewards achieved to get values for each state/action Generalized Policy Iteration 1) Evaluate the current policy by sampling experiences/episodes 2) Use updated values to improve the policy 3) Wash, rinse, repeat Temporal-Difference (TD) Learning Combine sample-based learning of Monte Carlo and dynamic programming efficiency of lookaheads Basic Idea Use lookahead with current estimates to learn from every experience!

Start with evaluation of fixed policy first Then use GPI to learn new policies TD learning: What we need Observed reward s Previous state a Action we took to get here R(s) s

Current state Actions we can take next TD evaluation Update current utility estimate for state s every time we take an action from it: TD error at this timestep +1 ( ) = ( ) + [ ( ) + ( ) ( ) ] Learning rate How much do we learn from each sample?

s is the outcome of taking action TD evaluation Update current utility estimate for state s every time we take an action from it: ( ) ( ) ( )

= ( ) + + [ +1 ( )]

+1 ( ) =( 1 ) ( )+ [ ( ) + ( ) ] If we extend the lookahead to the end of the sequence, get Monte Carlo estimation In the limit, will converge to true expected utility of s And faster than Monte Carlo! Finite lookahead allows dealing with infinite experience sequences! Effect of learning rate gives an exponential moving average The running interpolation update: Makes recent samples more important: Forgets about the past (distant past values were wrong anyway) Decreasing learning rate make averages converge over time

Example: Temporal Difference learning States Observed Transitions B, east, C, -2 A B C 0 D E 0

0 0 C, east, D, -2 0 8 -1 0 0 0 8 -1 3

0 Assume: = 1, = 1/2 +1 ( ) =( 1 ) ( )+ [ ( ) + ( ) ] 8 State utilities vs Q values We define the utility of a state as the maximum Q value over its actions: Under a deterministic policy, , so this becomes Exploration with -greedy policies To improve on the current policy , we need to know Q values for all (s,a) pairs,

not just what says. Instead of deterministically following , introduce some random exploration! Let be our exploration factor. Then, the action we choose in state s is: , = ( ) h ( ) h { This is called an -greedy policy Sarsa: learning policies with TD Like before, now estimate Q values for state-action pairs instead of state utilities +1 ( , ) ( , )+ [ ( ) + ( , ) ( , ) ] Uses all of s, a, R, s, a

Hence, Sarsa a, a both chosen by current (-greedy) policy Now, can use updated Q values to extract new policy ! Example: Sarsa States Decisions/Experiences B, east, C, -2 A B C C, east, D, -2 C, east

0 D E 0 0 1 -1 -1 1 4 1 0 2 4

8 0 0 0.5 -1 -1 1 4 1 0 2 4 8 0

0 0.5 -1 -1 1 4 1 4 8 4 Assume: = 1, = 1/2 +1 ( , ) =( 1 ) ( , ) + [ ( ) + ( , ) ]

On-policy vs off-policy learning In Sarsa, we choose a specific next action a for updating Q(s,a). This is called on-policy learning, because were using the current policy to help decide an action for learning from. Like policy iteration last time We can also look at all possible next actions A(s), and use the best one to update Q(s,a). This is off-policy learning, because the action we use for updating is separate from the action we actually take. Similar to value iteration, but active Q-learning: off policy TD To update Q(s,a), now use the estimated value for the best next action: +1 ( , ) ( , )+ ( ) + max ( , ) ( , ) [

( ) a, is the action we actually took to get here All available a actions are considered After update, use current (-greedy) policy to choose the actual a to take This will estimate optimal Q values regardless of the quality of the policy being followed! ] Example: Q learning States Decisions/Experiences B, east, C, -2

A B C C, east, D, -2 C, east 0 D E 0 0 1 -1

-1 1 4 1 0 2 4 8 0 0 1.5 -1 -1 1 4

1 0 2 0 8 0 0.5 -1 4 -1 1 4 1

4 8 4 Assume: = 1, = 1/2 [ +1 ( , ) = ( 1 ) ( , ) + ( ) + max ( , )

( ) ] Properties of TD learning Both Sarsa and Q learning improve the Q value estimates at each step Note this also improves the policy, which is dynamically recalculated at every step! With a couple assumptions, they will converge to the optimal policy/values, given infinite experiences Assume learning rate decreases over time (we know what were doing, eventually) Exploration factor must be high enough (though not too high!) Note: in the limit, it doesnt matter how you select actions!

Which is better? What are the advantages of on-policy vs off-policy learning? On-policy is a bit more realistic (especially with large action space) Off-policy sounds more optimal (but isnt in the limit) Mostly: ??? Making it (practically) better Better exploration with exploration functions Measuring mistakes with regret Better generalization with feature functions How to explore?

Several schemes for forcing exploration Simplest: random actions (-greedy) Every time step, flip a coin With (small) probability , act randomly With (large) probability 1-, act on current policy Problems with random actions? You do eventually explore the space, but keep thrashing around once learning is done One solution: lower over time Another solution: exploration functions Exploration functions When to explore? Random actions: explore a fixed amount Better idea: explore areas whose badness is not (yet) established, eventually stop exploring Exploration function Takes a value estimate u and a visit count n, and

returns an optimistic utility, e.g. ( , )=+ min ( , 1 ) Modified Q update: +1 ( , ) ( , )+ ( ) + max ( ( , ) , ( , ) ) ( , ) [

( ) Note: this propagates the bonus back to states that lead to unknown states as well! ] Regret Even if you learn the optimal policy, you still make mistakes along the way! Regret is a measure of your total mistake cost: the difference between your (expected) rewards, including youthful suboptimality, and optimal (expected) rewards Minimizing regret goes beyond learning to be optimal it requires optimally learning to be optimal Example: random exploration and exploration functions both end up optimal, but random exploration has higher regret

Approximate Q-learning with feature functions Generalizing across states Basic Q-Learning keeps a table of all q-values In realistic situations, we cannot possibly learn about every single state! Too many states to visit them all in training Too many states to hold the q-tables in memory Instead, we want to generalize: Learn about some small number of training states from experience Generalize that experience to new, similar situations This is a fundamental idea in machine learning, and well see it over and over again Example: Pacman Lets say we discover through experience

that this state is bad: In nave q-learning, we know nothing about this state: Or even this one! Back in time: Search vs Logic Logic Reason about the world to identify good actions Space of attributive states Successor function defined over attribute values Successors effectively partial states (what has changed?) Guided by knowledge about the world Pos: [1,1] Dot1: True Dot2: True

Pos: [1,2] Dot4: False Pos: [2,1] Dot8: False Feature-based representations Describe a state using a vector of features (attributes) Features are functions from states to real numbers (often 0/1) that capture important properties of the state Example features: Distance to closest ghost

Distance to closest dot Number of ghosts 1 / (dist to dot)2 Is Pacman in a tunnel? (0/1) etc. Is it the exact state on this slide? Can also describe a q-state (s, a) with features (e.g. action moves closer to food) Linear Value Functions Using a feature representation, we can write a q function (or value function) for any state using a few weights: ( )=1 1 ( ) + 2 2 ( ) + + ( ) ( , )=1 1 ( , ) +2 2 ( , ) + + ( , ) Advantage: our experience is summed up in a few powerful numbers Disadvantage: some states may be similar in features, but we want them to have very different values!

Approximate Q-Learning Q-learning with linear Q-functions: Exact Qs Approximate Qs Intuitive interpretation: Adjust weights of active features E.g., if something unexpectedly bad happens, blame the features that were on: disprefer all states with that states features Formal justification: online least squares Example: Q-Pacman Policy Search

Problem: often the feature-based policies that work well (win games, maximize utilities) arent the ones that approximate U / Q best Q-learnings priority: get Q-values close (modeling) Action selection priority: get ordering of Q-values right (prediction) Well see this distinction between modeling and prediction again later in the course Solution: learn policies that maximize rewards, not the values that predict them Policy search: start with an ok solution (e.g. Q-learning) then fine-tune by hill climbing on feature weights Policy Search Simplest policy search: Start with an initial linear value function or Q-function Nudge each feature weight up and down and see if your policy is better than before Problems: How do we tell the policy got better? Need to run many sample episodes! If there are a lot of features, this can be impractical

Better methods exploit lookahead structure, sample wisely, change multiple parameters Next Time First half review Genetic algorithms

Recently Viewed Presentations

  • SVC26 How Microsoft SharePoint 2010 is built with

    SVC26 How Microsoft SharePoint 2010 is built with

    Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES...
  • Supplementary Table 1 The tables show the measures

    Supplementary Table 1 The tables show the measures

    Note. HFS of the vmPFC and NAc core reduced the escape latency from the home-cage emergence test, indicating anxiolytic behavior. However, HFS of the vmPFC, but not other DBS targets, increased hedonia level in the sucrose intake test, and time...
  • CACA-Lectures

    CACA-Lectures

    Content. How to implement loop computations? Need registers to hold the state from one iteration to the next. Request-Response Latency-Insensitive modules
  • Satellite Link Design - SATELLITE COMMUNICATION - Home

    Satellite Link Design - SATELLITE COMMUNICATION - Home

    Satellite application engineers need to assess and allocate performance for each source of gain and loss. The link budget is the most effective means since it can address and display all of the components of the power balance equation, expressed...
  • Higher English Set Texts The Poetry of Norman

    Higher English Set Texts The Poetry of Norman

    The writer of the poem creates the mood using a number of elements such as language, setting, tone and theme. To define the mood of a poem, the reader should analyse how these different elements interact and what feeling or...
  • Resurfacing Heads & Blocks To restore gasket sealing

    Resurfacing Heads & Blocks To restore gasket sealing

    Take shallow cuts Dress stone often Use soap on stone Resurfacing Heads & Blocks Surfacing machines Milling Carbide generally for Aluminum CBN generally for Iron With good setup, heavy cuts are possible Hard spots dull cutters Resurfacing Heads & Blocks...
  • Vocabulary - Augusta County Public Schools

    Vocabulary - Augusta County Public Schools

    Vocabulary - Notes. Denotation - It is a word's strict dictionary meaning.. Connotation - It is the tone of a word - the emotions and associations you make with a word when you hear or read it.
  • From Cesium to X-Ray:A Cell Therapy Perspective

    From Cesium to X-Ray:A Cell Therapy Perspective

    Cell Characteristics. Viable cell numbers declined in both irradiated conditions. There was no statistically significant difference in the viability between X-Ray and Gamma radiation for all time points (p=0.73, two-tailed t-test).