Stack Overflow is a community of 4.7 million programmers, just like you, helping each other.

Join them; it only takes a minute:

Sign up

Join the Stack Overflow community to:

Ask programming questions
Answer and help your peers
Get recognized for your expertise

Double counting in temporal difference learning

up vote 1 down vote favorite

I am working on a temporal difference learning example (https://www.youtube.com/watch?v=XrxgdpduWOU), and I'm having some trouble with the following equation in my python implementation as I seem to be double counting rewards and Q.

If I coded the grid below as a 2d array, my current location is (2, 2) and the goal is (2, 3), assuming max reward is 1. Let Q(t) be the average mean of my current location, then r(t+1) is 1 and I assume max Q(t+1) is also 1, which results in my Q(t) becoming close to 2 (assuming gamma of 1). Is this correct, or should I assume that Q(n), where n is the end point is 0?

Edited to include code - I modified the get_max_q function to return 0 if it is the end point and the values are all now below 1 (which I assume is more correct since reward is just 1) but not sure if this is the right approach (previously I set it to return 1 when it was the end point).

#not sure if this is correct
def get_max_q(q, pos):
    #end point 
    #not sure if I should set this to 0 or 1
    if pos == (MAX_ROWS - 1, MAX_COLS - 1):
        return 0
    return max([q[pos, am] for am in available_moves(pos)])

def learn(q, old_pos, action, reward):
    new_pos = get_new_pos(old_pos, action)
    max_q_next_move = get_max_q(q, new_pos) 

    q[(old_pos, action)] = q[old_pos, action] +  alpha * (reward + max_q_next_move - q[old_pos, action]) -0.04

def move(q, curr_pos):
    moves = available_moves(curr_pos)
    if random.random() < epsilon:
        action = random.choice(moves)
    else:
        index = np.argmax([q[m] for m in moves])
        action = moves[index]

    new_pos = get_new_pos(curr_pos, action)

    #end point
    if new_pos == (MAX_ROWS - 1, MAX_COLS - 1):
        reward = 1
    else:
        reward = 0

    learn(q, curr_pos, action, reward)
    return get_new_pos(curr_pos, action)

=======================
OUTPUT
Average value (after I set Q(end point) to 0)
defaultdict(float,
            {((0, 0), 'DOWN'): 0.5999999999999996,
             ((0, 0), 'RIGHT'): 0.5999999999999996,
              ...
             ((2, 2), 'UP'): 0.7599999999999998})

Average value (after I set Q(end point) to 1)
defaultdict(float,
        {((0, 0), 'DOWN'): 1.5999999999999996,
         ((0, 0), 'RIGHT'): 1.5999999999999996,
         ....
         ((2, 2), 'LEFT'): 1.7599999999999998,
         ((2, 2), 'RIGHT'): 1.92,
         ((2, 2), 'UP'): 1.7599999999999998})

edited 37 mins ago

asked 1 hour ago

Dan Tang

581722

Show your code, show your desired output, and show the incorrect output that you're getting instead. – Tom Karzes 47 mins ago

@TomKarzes Thank you and I have included my code and output! – Dan Tang 36 mins ago

add a comment |

Your Answer

Sign up or log in

Post as a guest

Name

Post as a guest

Name

discard

By posting your answer, you agree to the privacy policy and terms of service.

Browse other questions tagged python machine-learning reinforcement-learning temporal-difference or ask your own question.

question feed

asked	today
viewed	17 times

current community

your communities

more stack exchange communities

Double counting in temporal difference learning

Your Answer

Browse other questions tagged python machine-learning reinforcement-learning temporal-difference or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Double counting in temporal difference learning

Know someone who can answer? Share a link to this question via email, Google+, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Browse other questions tagged python machine-learning reinforcement-learning temporal-difference or ask your own question.

Related

Hot Network Questions