Course Notes
CECS 477 : Neural Networks
Lecture of February 16, 2000
By Jim Ries, JimR@acm.org
Reinforcement Learning
Markov Decision Process
t=0,1,2,…
s=(1,2,…}
given state I, choose action u Î U(I)
iè U(I)
iè cost, our task is to minimize total cost
or
iè reward, our task is to maximize total reward
given i,u è j (new state) P_{ij}(u)
where P_{ij}(u) is the probability of going from state i to j given action u
action policy:
m _{i} iè u
Jm (i) - estimate of plan m starting form state i
infinite horizon: Jm (i) = Em [ for t=0 to t=¥ å g ^{t}c_{t} | s_{0}=i]
c_{t} = cost in time step t
s_{0} = start state
g ^{t} = discount factor (if reward is in future, it matters less)
0 < g £ 1 (typically, g would be decreasing with time)
finite horizon is same, but iterates to M (limit) only, and thus discount factor can be 1 without problem.
Sometimes, we want average cost, e.g.,
Em _{ }[for t=0 to t=¥ å g ^{t}c_{t} | s_{0}=i] / M
(or the lim Mè ¥ of above in infinite case)
We use dynamic programming to find optimal policy. Reinforcement learning is a variant of dynamic programming.
value iteration
policy iteration
offline - traditional dynamic programming
online - reinforcement learning
Bellman equation
J_{k+1}(i) = min where uÎ U(i) [Cg (u) + g * jÎ s å P_{ij}(u) J_{k}(i)] = Q(i,u)
above equation without "k's" is Bellman equation
greedy policy
"one-step backup" - uses next state to update estimate of current state.
Policy Iteration
J(i) = cg (u) + g * jÎ s å P_{ij}(u) J(j)
where u is action chose u = m (i)
update policy, go back, …
algorithm:
init current policy
while current policy ¹ next policy
learn value function of policy
update policy (greedy actions based on current learned value function)
endwhile
Process will converge to optimal policy
Claim 1: Only policies that are greedy with regard to their own value function are optimal policies.
Q^{J*}(i,m ^{*}(i)) = min uÎ U(i) Q^{J*}(i,u)
m ^{*} is optimal policy
J^{*} is value function of m ^{*}
Claim 2: A necessary and sufficient condition for value function to be value function of the optimal policy is:
J(i) = min uÎ U(i) [C_{i}(u) + g * jÎ s å P_{ij}(u) J(j)] (Bellman Equation)
Thus, a solution to Bellman's equation leads to an optimal policy.
Synchronous update: J_{k+1}(i) = min Q(i,u) (update all states in parallel)
Asynchronous update: update each state iteratively.
Both converge.
Real-time Dynamic Programming (an instance of on-line)
Random exploration
Prob(i,a) = e^{Q(i,a)/t}/å for all a e^{Q(i,a)/t}
t = "temperature"; t starts very high and settles ("cools") down after a while. So, we start random and make it more deterministic over time.
Dynamic Programming with incomplete information
Non-baysian - indirect estimate. Direct estimate is Q-Learning (model-free RL)
Bartol, et. al. 1990
indirect estimate
P_{ij}(u) = h _{ij}^{u} / å lÎ s h _{il}^{u}
where h _{ij}^{u} is # of observed transitions from i to j after action u
(maximum likelihood estimate)
Q-Learning
Off-line Q-learning
Adaptive Q-learning
Real-time Q-learning (also just called "Q-learning"; original Q-learning)
off-line Q-learning
Q_{k+1}(i,u) = (1- a _{k}(i,u))Q_{k}(i,u) + a _{k}(i,u)[c_{i}(u) + g * min over u' Q(succ(i,u), u')]
where a is the learning rate; rewrite as:
D Q_{k+1}(i,u) = a _{k}(i,u)[c_{i}(u) + g * min over u' Q(succ(i,u), u') - Q_{k}(i,u)]
So, since we update many times, and only learn a little (a ) each time, we can try many successors and learn most probable ones.
In real-time, we have no model of world so we just get successor from the real world (prevailing conditions).
a should decrease over time. If each pair updated infinitely often and a decreases over time, algorithm will converge with probability 1.
Q-Learning |
Value Iteration |
more space |
only need value of each state |
update value many times |
one-step updates |
each step fast (good for real-time) |
each step slow (due to å ) |
Real-time and offline can be mixed. "Contemplate" after actions learned.
Reference:
Barto, A.G., Bradtke, S.J., and Singh, S.P. (1995) Learning to act using real-time dynamic programming, Artificial Intelligence 72: 81–138.
(see http://www.umass.edu/neuro/faculty/barto.html for more on Andrew Barto)
Storing Value functions (Q or J)
Exact Method
Lookup table representation, i.e.,
State |
Action |
Q-value |
… |
… |
… |
Approximate methods:
Paper presented by Jonathon Marjamaa
"Self-Improving Reactive Agents Based On Reinforcement Learning, Planning, and Teaching"
By Long-Ji Lin, Carnegie Mellon University 1992