I was trying to measure the training time with different batch size. It shows that backward() and step() increase with larger batch size (forward time also increases but that’s expected). The following times are all averaged over all iterations in the epoch. (I believe I measured time correctly with torch.cuda.synchronize()

Could you show how you use synchronize? Your forward is suspiciously fast compared to the backward
Also the forward/backward being slower for larger batch sizes is expected as there is more work to do!

Thank you very much for the reply! Yes, I was expecting forward time increases with larger batch size, but backward and step time keeps the same. Here is the code

And the backward time will increase the same as the forward time. THe more things you used in the forward, the more things needs to be done in the backward

Thanks! Is it right that forward() calculates the gradients for parameters with require_grad=True, and backward() will accumulate those? If so, it makes sense that backward() increases with batch size. But it still doesn’t make sense that step() also takes longer… Any idea?

As I understand, in forward(), they compute the function and store the input and keep the place holder for the gradient from next node (by chain rule).

In backward(), from gradient of loss function, it compute all the gradient to be ready to optimize.

In step(), they update all gradient depend on your optimization method (SGD, Adam, RMSProp, etc).

So when you increase batch size, all computation cost will raise which leads to an increase in time.

You are right: the forward only evaluates the forward (and save the necessary info for the backward)
The backward compute the gradients and accumulate them.
The step only performs the gradient step (this does not depend on the batch size as the gradient accumulation is done during the backward).