i am a bit confused about the implementation of n-step learning. Lets assume we have n=4 and a constant reward of +1. lets also assume gamma=1 for simplification. So if we have the following:
s_2=[1,1,1,T] Close to target
s_1 = 1+1+1+1 = 4
s_2= 1+1+1 = 3
So State s_1 would be much more desirable although s_2 is close to the aim of the task. Or lets put the question different. How would you calculate state s_2 or a state s_3 = [1,T,1,1] ?