CECS 477 : Neural Networks
Lecture of February 16, 2000
By Jim Ries, JimR@acm.org
Markov Decision Process
given state I, choose action u Î U(I)
iè cost, our task is to minimize total cost
iè reward, our task is to maximize total reward
given i,u è j (new state) Pij(u)
where Pij(u) is the probability of going from state i to j given action u
m i iè u
Jm (i) - estimate of plan m starting form state i
infinite horizon: Jm (i) = Em [ for t=0 to t=¥ å g tct | s0=i]
ct = cost in time step t
s0 = start state
g t = discount factor (if reward is in future, it matters less)
0 < g £ 1 (typically, g would be decreasing with time)
finite horizon is same, but iterates to M (limit) only, and thus discount factor can be 1 without problem.
Sometimes, we want average cost, e.g.,
Em [for t=0 to t=¥ å g tct | s0=i] / M
(or the lim Mè ¥ of above in infinite case)
We use dynamic programming to find optimal policy. Reinforcement learning is a variant of dynamic programming.
offline - traditional dynamic programming
online - reinforcement learning
Jk+1(i) = min where uÎ U(i) [Cg (u) + g * jÎ s å Pij(u) Jk(i)] = Q(i,u)
above equation without "k's" is Bellman equation
"one-step backup" - uses next state to update estimate of current state.
J(i) = cg (u) + g * jÎ s å Pij(u) J(j)
where u is action chose u = m (i)
update policy, go back, …
init current policy
while current policy ¹ next policy
learn value function of policy
update policy (greedy actions based on current learned value function)
Process will converge to optimal policy
Claim 1: Only policies that are greedy with regard to their own value function are optimal policies.
QJ*(i,m *(i)) = min uÎ U(i) QJ*(i,u)
m * is optimal policy
J* is value function of m *
Claim 2: A necessary and sufficient condition for value function to be value function of the optimal policy is:
J(i) = min uÎ U(i) [Ci(u) + g * jÎ s å Pij(u) J(j)] (Bellman Equation)
Thus, a solution to Bellman's equation leads to an optimal policy.
Synchronous update: Jk+1(i) = min Q(i,u) (update all states in parallel)
Asynchronous update: update each state iteratively.
Real-time Dynamic Programming (an instance of on-line)
Prob(i,a) = eQ(i,a)/t/å for all a eQ(i,a)/t
t = "temperature"; t starts very high and settles ("cools") down after a while. So, we start random and make it more deterministic over time.
Dynamic Programming with incomplete information
Non-baysian - indirect estimate. Direct estimate is Q-Learning (model-free RL)
Bartol, et. al. 1990
Pij(u) = h iju / å lÎ s h ilu
where h iju is # of observed transitions from i to j after action u
(maximum likelihood estimate)
Real-time Q-learning (also just called "Q-learning"; original Q-learning)
Qk+1(i,u) = (1- a k(i,u))Qk(i,u) + a k(i,u)[ci(u) + g * min over u' Q(succ(i,u), u')]
where a is the learning rate; rewrite as:
D Qk+1(i,u) = a k(i,u)[ci(u) + g * min over u' Q(succ(i,u), u') - Qk(i,u)]
So, since we update many times, and only learn a little (a ) each time, we can try many successors and learn most probable ones.
In real-time, we have no model of world so we just get successor from the real world (prevailing conditions).
a should decrease over time. If each pair updated infinitely often and a decreases over time, algorithm will converge with probability 1.
only need value of each state
update value many times
each step fast (good for real-time)
each step slow (due to å )
Real-time and offline can be mixed. "Contemplate" after actions learned.
Barto, A.G., Bradtke, S.J., and Singh, S.P. (1995) Learning to act using real-time dynamic programming, Artificial Intelligence 72: 81–138.
(seehttp://www.umass.edu/neuro/faculty/barto.html for more on Andrew Barto)
Storing Value functions (Q or J)
Lookup table representation, i.e.,
Paper presented by Jonathon Marjamaa
"Self-Improving Reactive Agents Based On Reinforcement Learning, Planning, and Teaching"
By Long-Ji Lin, Carnegie Mellon University 1992http://riesj.hmi.missouri.edu/CECS477/RL/