I am having a hard time understanding the backpropagation computational cost of the two functionally equivalent approaches below. I have a function ‘model’ which takes n inputs and returns n outputs and I want to compute the gradient of the outputs with respects to the model parameters. I would expect approach 2 where I am computing the function n times to be slower than approach 1. However in this example approach 1 takes 9.3s vs 2.6s for approach 2.
Is there some inner workings of autograd which would explain such large difference?
- Computing the gradients by looping over the outputs:
outputs = model(inputs) grads = [torch.autograd.grad(output, model.parameters(), only_inputs=True, retain_graph=True, allow_unused=True) for output in outputs.unbind(0)]
- Computing the gradients by looping over the batch inputs:
grads = [torch.autograd.grad(model(inputs[i]), model.parameters(), only_inputs=True, retain_graph=True, allow_unused=True) for i in range(inputs.shape)]