Why loss.backward and optimizer.step take longer with larger batch size?

I was trying to measure the training time with different batch size. It shows that backward() and step() increase with larger batch size (forward time also increases but that’s expected). The following times are all averaged over all iterations in the epoch. (I believe I measured time correctly with torch.cuda.synchronize()

batch size = 96
Train Epoch #1: 100%|██████████| 3336/3336 [19:59<00:00, 2.78it/s, loss=5.85, top1=5.28, top5=15, data_time=0.00287, forward_time=0.0669, backward_time=0.197, cuda_time=0.00804, step_time=0.0417
batch size = 64
Train Epoch #1: 100%|█████████▉| 4995/5004 [19:58<00:02, 3.90it/s, loss=6.16, top1=2.9, top5=9.46, data_time=0.00134, forward_time=0.0444, backward_time=0.135, cuda_time=0.00387, step_time=0.0225
batch size = 32
Train Epoch #1: 100%|██████████| 10008/10008 [26:11<00:00, 6.37it/s, loss=6.84, top1=0.266, top5=1.22, data_time=0.000503, forward_time=0.0267, backward_time=0.0792, cuda_time=0.00204, step_time=0.0193]



Could you show how you use synchronize? Your forward is suspiciously fast compared to the backward :smiley:
Also the forward/backward being slower for larger batch sizes is expected as there is more work to do!

Thank you very much for the reply! Yes, I was expecting forward time increases with larger batch size, but backward and step time keeps the same. Here is the code

            time0 = time.time()
            images, labels = images.cuda(), labels.cuda()
            target = labels
            cuda_time.update(time.time() - time0)
            # soft target
            if args.kd_ratio > 0:
                with torch.no_grad():
                    soft_logits = args.teacher_model(images).detach()
                    soft_label = F.softmax(soft_logits, dim=1)

            # clear gradients
            time1 = time.time()
            output = run_manager.net(images)
            loss = run_manager.train_criterion(output, labels)
            forward_time.update(time.time() - time1)

            time2 = time.time()
            backward_time.update(time.time() - time2)

            time3 = time.time()
            step_time.update(time.time() - time3)

Ok the timing looks good.

And the backward time will increase the same as the forward time. THe more things you used in the forward, the more things needs to be done in the backward :smiley:

Thanks! Is it right that forward() calculates the gradients for parameters with require_grad=True, and backward() will accumulate those? If so, it makes sense that backward() increases with batch size. But it still doesn’t make sense that step() also takes longer… Any idea?

As I understand, in forward(), they compute the function and store the input and keep the place holder for the gradient from next node (by chain rule).

In backward(), from gradient of loss function, it compute all the gradient to be ready to optimize.

In step(), they update all gradient depend on your optimization method (SGD, Adam, RMSProp, etc).

So when you increase batch size, all computation cost will raise which leads to an increase in time.

You are right: the forward only evaluates the forward (and save the necessary info for the backward)
The backward compute the gradients and accumulate them.
The step only performs the gradient step (this does not depend on the batch size as the gradient accumulation is done during the backward).

1 Like