Gradient calculation of multiple outputs different using torch.stack

ferqui · December 7, 2021, 11:49am

Hi, I’m training a recurrent network and I want to know the intermidiate gradients of the output of the network over time, but when calculating the gradient I notice that is different if I calculate the gradient in each time independently or if I calculate it together using torch.stack, but I don’t fully understand why. Below is an example that has different value for dE_dz. Can someone help me to understand it? Thank you

for t in range(T):
    out, state = network(inputs[:,t], state)
    out.retain_grad()
    spikes.append(out)
...
loss.backward()
dE_dz = torch.stack([s.grad for s in spikes], 1)

for t in range(T):
    out, state = network(inputs[:,t], state)
    spikes.append(out)
...
spikes = torch.stack(spikes)
spikes.retain_grad()
loss.backward()
dE_dz = spikes.grad