Suppose we have a computational graph like this
P = NN policy
D = NN dynamics
S0 = Variable()
A1 = P(S0)
S1 = D(S0, A1)
A2 = P(S1)
S2 = D(S1, A2)
L = cost(S1) + cost(S2)
L.backward()
Update policy P.step()
Since each call of dynamics D
will bifurcate the backward path into 2 trajectories, will backward function automatically adds their gradients ?
What I am doing now is make a clone of policy P1, P_clone
and use P1
in first action selection, and P_clone
for all the consecutive time steps.