Course Notes

CECS 477 : Neural Networks

Lecture of February 16, 2000

By Jim Ries, JimR@acm.org

Reinforcement Learning

• AI machine learning dates back to late 1950's (most in late 198's, early 1990's)
• Operations Research (control)
• Markov decision processes
• Make decisions based on current state. State of world changes. New state stochastically determined by previous state and current conditions (nothing in past).

Markov Decision Process

• Discrete time, stochastic system with discrete state space

t=0,1,2,…

s=(1,2,…}

given state I, choose action u Î U(I)

iè U(I)

iè cost, our task is to minimize total cost

or

iè reward, our task is to maximize total reward

given i,u è j (new state) Pij(u)

where Pij(u) is the probability of going from state i to j given action u

action policy:

m i iè u

Jm (i) - estimate of plan m starting form state i

infinite horizon: Jm (i) = Em [ for t=0 to t=¥ å g tct | s0=i]

ct = cost in time step t

s0 = start state

g t = discount factor (if reward is in future, it matters less)

0 < g £ 1 (typically, g would be decreasing with time)

finite horizon is same, but iterates to M (limit) only, and thus discount factor can be 1 without problem.

Sometimes, we want average cost, e.g.,

Em [for t=0 to t=¥ å g tct | s0=i] / M

(or the lim Mè ¥ of above in infinite case)

We use dynamic programming to find optimal policy. Reinforcement learning is a variant of dynamic programming.

value iteration

policy iteration

online - reinforcement learning

Bellman equation

• idea: improve policy and estimates of value of policy through many iterations (successive approximations) and at the same time improve the policy.

Jk+1(i) = min where uÎ U(i) [Cg (u) + g * jÎ s å Pij(u) Jk(i)] = Q(i,u)

above equation without "k's" is Bellman equation

greedy policy

"one-step backup" - uses next state to update estimate of current state.

Policy Iteration

• attempts to decouple learning of policy from learning of the value function

J(i) = cg (u) + g * jÎ s å Pij(u) J(j)

where u is action chose u = m (i)

update policy, go back, …

algorithm:

init current policy

while current policy ¹ next policy

learn value function of policy

update policy (greedy actions based on current learned value function)

endwhile

Process will converge to optimal policy

• In value iteration, we use greedy policy at every step. Since value function is changing, so is the policy. Once we learn value function, the policy will stop changing.
• In policy iteration,…

Claim 1: Only policies that are greedy with regard to their own value function are optimal policies.

QJ*(i,m *(i)) = min uÎ U(i) QJ*(i,u)

m * is optimal policy

J* is value function of m *

Claim 2: A necessary and sufficient condition for value function to be value function of the optimal policy is:

J(i) = min uÎ U(i) [Ci(u) + g * jÎ s å Pij(u) J(j)] (Bellman Equation)

Thus, a solution to Bellman's equation leads to an optimal policy.

Synchronous update: Jk+1(i) = min Q(i,u) (update all states in parallel)

Asynchronous update: update each state iteratively.

Both converge.

Real-time Dynamic Programming (an instance of on-line)

• learn value function while interacting with the world
• update only states we "visit" (asynchronous updating)
• intermediate results will be used in control
• must update every state infinitely often to converge (can use random exploration scheme to assure all states are explored). Can also estimate confidence for each region, and intentionally visit states not often visited.

Random exploration

Prob(i,a) = eQ(i,a)/t/å for all a eQ(i,a)/t

t = "temperature"; t starts very high and settles ("cools") down after a while. So, we start random and make it more deterministic over time.

• this is essentially "simulated annealing"

Dynamic Programming with incomplete information

• e.g., we don't have probability of transition
• Baysian or Non-baysian estimate of transition probabilities

Non-baysian - indirect estimate. Direct estimate is Q-Learning (model-free RL)

Bartol, et. al. 1990

indirect estimate

Pij(u) = h iju / å lÎ s h ilu

where h iju is # of observed transitions from i to j after action u

(maximum likelihood estimate)

Q-Learning

Off-line Q-learning

Real-time Q-learning (also just called "Q-learning"; original Q-learning)

off-line Q-learning

Qk+1(i,u) = (1- a k(i,u))Qk(i,u) + a k(i,u)[ci(u) + g * min over u' Q(succ(i,u), u')]

where a is the learning rate; rewrite as:

D Qk+1(i,u) = a k(i,u)[ci(u) + g * min over u' Q(succ(i,u), u') - Qk(i,u)]

So, since we update many times, and only learn a little (a ) each time, we can try many successors and learn most probable ones.

In real-time, we have no model of world so we just get successor from the real world (prevailing conditions).

a should decrease over time. If each pair updated infinitely often and a decreases over time, algorithm will converge with probability 1.

 Q-Learning Value Iteration more space only need value of each state update value many times one-step updates each step fast (good for real-time) each step slow (due to å )

Real-time and offline can be mixed. "Contemplate" after actions learned.

Reference:

Barto, A.G., Bradtke, S.J., and Singh, S.P. (1995) Learning to act using real-time dynamic programming, Artificial Intelligence 72: 81–138.

(see http://www.umass.edu/neuro/faculty/barto.html for more on Andrew Barto)

Storing Value functions (Q or J)

Exact Method

Lookup table representation, i.e.,

 State Action Q-value … … …

Approximate methods:

• linear, weighted sums
• back propagation net
• decision trees