Hi, I’m training a recurrent network and I want to know the intermidiate gradients of the output of the network over time, but when calculating the gradient I notice that is different if I calculate the gradient in each time independently or if I calculate it together using torch.stack, but I don’t fully understand why. Below is an example that has different value for dE_dz. Can someone help me to understand it? Thank you
for t in range(T):
out, state = network(inputs[:,t], state)
out.retain_grad()
spikes.append(out)
...
loss.backward()
dE_dz = torch.stack([s.grad for s in spikes], 1)
for t in range(T):
out, state = network(inputs[:,t], state)
spikes.append(out)
...
spikes = torch.stack(spikes)
spikes.retain_grad()
loss.backward()
dE_dz = spikes.grad