I would like to ask a question that I thought Pytorch has taken care of, but it has not.
Well, the main loop for training a network is like
outputs = model(images)
loss = F.cross_entropy(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
I thought that the parameter update process (optimizer.step()
) was happening during the calculation of the gradients (loss.backward()
). Give a 3-layer network for example. When the gradients in layer 3 have been calculated, the update could happen right after that, which is in parallel with the calculation of the gradients in layer 2. Instead, optimizer.step()
happens after the completion of loss.backward()
. This leads to some efficiency drop. As a matter of fact, when I remove optimizer.step()
, the GPU utilization is improved (say 70%->80%). My theory is that optimizer.step()
keep the network training idle for quite some time, and the GPU becomes hungry.
So, is the parallel process of updating parameter and calculating the gradients a difficult thing, or Pytorch just chooses not to do this for some reason? Thank you!