reset password
Author Message
rabbott
Posts: 1649
Posted 14:30 Dec 01, 2018 |

Look at the extract from one of the Berkeley slides (attached below).

The equation in the blue rectangle shows the definition of Q(s, a). This is the q-value of taking action a in state s. This is the same q-value whether we are talking about state-based learning or feature-based learning. It is the value of taking a given action in a given state. In feature-based learning, since we don't keep track of states, this will be an approximation defined as the sum of the products of features and weights.

Each feature is shown as a function of a state and an action. In reality, especially for the problems we are working on, we "factor out" the action and can think of the equation on the slide as if it read:

Q(s, a) = R(s, a) + w1f1(s') + w2f2(s') + ... + wnfn(s'),

where R(s, a) is the reward for taking action a in state s; and s' is the state resulting from taking action a in state s

In state-based q-learning, one updates one's estimate of Q(s, a) as shown on the slide. In feature-based learning, we don't keep track of the states. Instead we update the weights as shown on the slide.

A couple of things are worth emphasizing.

1. Q(s, a) includes the reward for taking action a in state s.

2. Features are extracted from the states without regard to the action that one might take in those states. In other words, a feature is a function of a state alone, whereas Q(s, a) is a function of a state and action.

The features are intended to be abstractions that isolate the important features of states. For example, in the taxi problem a feature may whether the passenger is on the taxi. That feature applies, either true or false, to many different states. It identifies an important feature of all of them.

If the features have a relatively small number of possible discrete values, one could use the feature values themselves as states

(the state would be a tuple of the feature values) and do q-learning using the state-based approach!

Last edited by rabbott at 19:53 Dec 01, 2018.
rabbott
Posts: 1649
Posted 19:56 Dec 01, 2018 |

The last two lines of the previous post are not correct. (I left them but put a strike-through line through them because some people may already have seen them.) The problem is that an action that takes a set of features that includes, for example, distancetToPassPickup will not work the same way from one game to the next. In one game going North may reduce that distance whereas in another game the taxi would might have to go South to reduce that distance. 

Last edited by rabbott at 19:58 Dec 01, 2018.