Model update after a complete loss.backward()

Bruce_zhuang · April 9, 2020, 8:23am

I would like to ask a question that I thought Pytorch has taken care of, but it has not.

Well, the main loop for training a network is like

outputs = model(images)
loss = F.cross_entropy(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()

I thought that the parameter update process (optimizer.step()) was happening during the calculation of the gradients (loss.backward()). Give a 3-layer network for example. When the gradients in layer 3 have been calculated, the update could happen right after that, which is in parallel with the calculation of the gradients in layer 2. Instead, optimizer.step() happens after the completion of loss.backward(). This leads to some efficiency drop. As a matter of fact, when I remove optimizer.step(), the GPU utilization is improved (say 70%->80%). My theory is that optimizer.step() keep the network training idle for quite some time, and the GPU becomes hungry.

So, is the parallel process of updating parameter and calculating the gradients a difficult thing, or Pytorch just chooses not to do this for some reason? Thank you!

ptrblck · April 10, 2020, 5:25am

I think the main design decision for this behavior is, because the gradient calculation and the update steps are independent from each other.
E.g. you can easily accumulate gradients with multiple forward/backward passes, if a larger batch doesn’t fit in your memory.
Also, you might work with multiple losses, which will use their own .backward() call.

Forcing the optimizer.step() method to kick in right after the first gradient was calculated seems like a huge limiting factor.

Bruce_zhuang · April 19, 2020, 2:17am

Yes. seems like a bit too wrapped up in that way. But we can still improve the efficiency a bit by manually doing the simultaneous update right (in the simplest case in my question)?