reset password
Author Message
rabbott
Posts: 1649
Posted 14:43 Nov 27, 2018 |

I found some features that do a pretty good job of solving the cart-pole problem. Two interesting things about them.

1. They are quite simple and straightforward. They focus on indicators that the system is in a more stable state, i.e., the various parameters are closer to zero.

2. The way the features are expressed makes a difference. Some ways of expressing what look like the same sorts of properties work while others don't. I find that disappointing, but that seems to be the way it is. If you think you have good features, try to express them differently and see if it helps. For example if you have a feature x, try (constant - x). Or if you are dividing a feature by a constant to get it to be less than 1, try a different constant. The value range that seemed to work reasonably well for me was approximately between 0.2 and 0.5. See if you can express your features so that they fall in that range and see if that helps.

Last edited by rabbott at 15:31 Nov 27, 2018.
rabbott
Posts: 1649
Posted 15:28 Nov 27, 2018 |

P.S. When I say that the features were relatively simple and straightforward, I mean that they reflect progress toward the goal(s). For example, for the taxi problem a feature might be how close the taxi is to its current destination. For the cart-pole problem, a feature might be how close the various parameters are to 0. For Capture-the Flag, a feature might be how much food is eaten (if you are an offensive agent) or how many enemies are in your territory (if you are a defensive agent). 

Last edited by rabbott at 15:32 Nov 27, 2018.
rabbott
Posts: 1649
Posted 18:51 Nov 27, 2018 |

P.P.S It turns out that my cart-pole solution had features with values in the range 0.07 - 0.3. To move them closer to the target I mentioned above, I added 0.15 to each. I tried a number of other similar transformations. Some worked, and some didn't, even though they were all in the same range and seemed qualitatively similar. One of the variants I tried was to double each value and subtract 0.1. It got good results at first, but like some others, it found good weights only to lose them after more training. The attached plot shows the result of running the problem after every hundred training episodes from 100 to 1000, plus an extra run at the end. As you can see, it ran for 2,000 steps after 200 training runs and then again after 500 - 900 training runs. (I stopped each run after 2000 steps if it made it that far.)  Then it lost it for the final two runs. (Even these last two runs would have satisfied the original problem, which was to last for 200 steps.) I think either David Silver or Charles and Michael talked about this phenomenon.

Attachments:
Last edited by rabbott at 18:55 Nov 27, 2018.
rabbott
Posts: 1649
Posted 07:54 Nov 28, 2018 |

See next post.

Last edited by rabbott at 08:03 Nov 28, 2018.
rabbott
Posts: 1649
Posted 08:00 Nov 28, 2018 |

This code has been extracted and slightly edited from the cart-pole environment and offers an easy way to anticipate the effect of an action on a state.
 

@staticmethod
def exceedsLimit(x, theta):
    theta_threshold_radians = 12 * 2 * math.pi / 360
    x_threshold = 2.4
    exceeds = abs(x) > x_threshold or abs(theta) > theta_threshold_radians
    return exceeds


@staticmethod 
def step(state, action):
    (x, x_dot, theta, theta_dot) = state
    done = self.exceedsLimit(x, theta)
    if done:
        reward = 0
        return (None, reward, done, None)

    gravity = 9.8
    masscart = 1.0
    masspole = 0.1
    total_mass = (masspole + masscart)
    length = 0.5 # actually half the pole's length
    polemass_length = (masspole * length)
    force_mag = 10.0
    tau = 0.02  # seconds between state updates
    force = force_mag if action==1 else -force_mag
    costheta = math.cos(theta)
    sintheta = math.sin(theta)
    temp = (force + polemass_length * theta_dot * theta_dot * sintheta) / total_mass
    thetaacc = ((gravity * sintheta - costheta* temp) / 
                (length * (4.0/3.0 - masspole * costheta * costheta / total_mass)))
    xacc  = temp - polemass_length * thetaacc * costheta / total_mass
    x  = x + tau * x_dot
    x_dot = x_dot + tau * xacc
    theta = theta + tau * theta_dot
    theta_dot = theta_dot + tau * thetaacc
    state = (x, x_dot, theta, theta_dot)
    reward = 1
    return (state, reward, done, None)
Last edited by rabbott at 20:53 Nov 28, 2018.