Course Notes
CECS 477 : Neural Networks
Lecture of February 16, 2000
By Jim Ries, JimR@acm.org
Reinforcement Learning
Markov Decision Process
t=0,1,2,…
s=(1,2,…}
given state I, choose action u Î U(I)
iè U(I)
iè cost, our task is to minimize total cost
or
iè reward, our task is to maximize total reward
given i,u è j (new state) Pij(u)
where Pij(u) is the probability of going from state i to j given action u
action policy:
m i iè u
Jm (i) - estimate of plan m starting form state i
infinite horizon: Jm (i) = Em [ for t=0 to t=¥ å g tct | s0=i]
ct = cost in time step t
s0 = start state
g t = discount factor (if reward is in future, it matters less)
0 < g £ 1 (typically, g would be decreasing with time)
finite horizon is same, but iterates to M (limit) only, and thus discount factor can be 1 without problem.
Sometimes, we want average cost, e.g.,
Em [for t=0 to t=¥ å g tct | s0=i] / M
(or the lim Mè ¥ of above in infinite case)
We use dynamic programming to find optimal policy. Reinforcement learning is a variant of dynamic programming.
value iteration
policy iteration
offline - traditional dynamic programming
online - reinforcement learning
Bellman equation
Jk+1(i) = min where uÎ U(i) [Cg (u) + g * jÎ s å Pij(u) Jk(i)] = Q(i,u)
above equation without "k's" is Bellman equation
greedy policy
"one-step backup" - uses next state to update estimate of current state.
Policy Iteration
J(i) = cg (u) + g * jÎ s å Pij(u) J(j)
where u is action chose u = m (i)
update policy, go back, …
algorithm:
init current policy
while current policy ¹ next policy
learn value function of policy
update policy (greedy actions based on current learned value function)
endwhile
Process will converge to optimal policy
Claim 1: Only policies that are greedy with regard to their own value function are optimal policies.
QJ*(i,m *(i)) = min uÎ U(i) QJ*(i,u)
m * is optimal policy
J* is value function of m *
Claim 2: A necessary and sufficient condition for value function to be value function of the optimal policy is:
J(i) = min uÎ U(i) [Ci(u) + g * jÎ s å Pij(u) J(j)] (Bellman Equation)
Thus, a solution to Bellman's equation leads to an optimal policy.
Synchronous update: Jk+1(i) = min Q(i,u) (update all states in parallel)
Asynchronous update: update each state iteratively.
Both converge.
Real-time Dynamic Programming (an instance of on-line)
Random exploration
Prob(i,a) = eQ(i,a)/t/å for all a eQ(i,a)/t
t = "temperature"; t starts very high and settles ("cools") down after a while. So, we start random and make it more deterministic over time.
Dynamic Programming with incomplete information
Non-baysian - indirect estimate. Direct estimate is Q-Learning (model-free RL)
Bartol, et. al. 1990
indirect estimate
Pij(u) = h iju / å lÎ s h ilu
where h iju is # of observed transitions from i to j after action u
(maximum likelihood estimate)
Q-Learning
Off-line Q-learning
Adaptive Q-learning
Real-time Q-learning (also just called "Q-learning"; original Q-learning)
off-line Q-learning
Qk+1(i,u) = (1- a k(i,u))Qk(i,u) + a k(i,u)[ci(u) + g * min over u' Q(succ(i,u), u')]
where a is the learning rate; rewrite as:
D Qk+1(i,u) = a k(i,u)[ci(u) + g * min over u' Q(succ(i,u), u') - Qk(i,u)]
So, since we update many times, and only learn a little (a ) each time, we can try many successors and learn most probable ones.
In real-time, we have no model of world so we just get successor from the real world (prevailing conditions).
a should decrease over time. If each pair updated infinitely often and a decreases over time, algorithm will converge with probability 1.
Q-Learning |
Value Iteration |
more space |
only need value of each state |
update value many times |
one-step updates |
each step fast (good for real-time) |
each step slow (due to å ) |
Real-time and offline can be mixed. "Contemplate" after actions learned.
Reference:
Barto, A.G., Bradtke, S.J., and Singh, S.P. (1995) Learning to act using real-time dynamic programming, Artificial Intelligence 72: 81–138.
(see
http://www.umass.edu/neuro/faculty/barto.html for more on Andrew Barto)Storing Value functions (Q or J)
Exact Method
Lookup table representation, i.e.,
State |
Action |
Q-value |
… |
… |
… |
Approximate methods:
Paper presented by Jonathon Marjamaa
"Self-Improving Reactive Agents Based On Reinforcement Learning, Planning, and Teaching"
By Long-Ji Lin, Carnegie Mellon University 1992
http://riesj.hmi.missouri.edu/CECS477/RL/