reset password
Author Message
asalin14
Posts: 12
Posted 19:13 Oct 01, 2018 |

In the "Adventure in Machine Learning" article, they define the Q learning rule as :

Q(s,a)=Q(s,a)+α(r+γmaxa′Q(s′,a′)–Q(s,a))

where α affects the learning rate.

 

In their implementation, the reward is not affected by the learning rate:

q_table[s, a] += r + lr *(y*np.max(q_table[new_s, :]) - q_table[s, a])

 

I would think the equation would be:

q_table[s, a] += lr *(r + y*np.max(q_table[new_s, :]) - q_table[s, a])

where r is multiplied by the learning rate. Is the article's implementation correct? Will the final answer eventually be the same?

rabbott
Posts: 1649
Posted 21:52 Oct 01, 2018 |

The article shows the equation as including the reward in the portion affected by the learning rate.

 

Q(s,a)=Q(s,a)+α(r+γ*maxQ(s′,a′)–Q(s,a))

But code shows it outside, as you say.

There is an inconsistency in the field in how the reward is understood. Should it be considered associated with the state the agent is coming from?  That is, has the agent already earned the reward but is getting credit for it on its next step. Or is it part of the actual next step?  The former approach would not have the reward affected by the learning rate. The latter would. I think in the end, it won't make any difference. The process should converge either way -- as long as you are consistent about it.  I prefer to consider the reward part of the step.