Speed of different batch size

I have a problem to figure out the influence of batch size when the input requires grad.

Here is my code,

def test_generate_time(model, epoch_num, bs):
    for p in model.parameters():
        p.require_grad = False
    model = model.cuda()

    imgs = torch.zeros((bs, 3, 224, 224)).float().cuda()
    imgs.requires_grad = True      # want to update inputs
    labels = torch.ones(bs).long().cuda()
    criterion = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam([imgs], lr=1e-6)

    for epoch in range(epoch_num):
        epoch_begin = time.time()

        forward_start = time.time()
        out = model(imgs)
        loss = criterion(out, labels)
        forward_time = time.time() - forward_start

        backward_start = time.time()
        backward_time = time.time() - backward_start
        epoch_time = time.time() - epoch_begin
        print('Bs {} forward time {:.3f} backward time {:.3f} epoch_time {:.3f}'.format(bs, forward_time, backward_time, epoch_time))

And here is the result of batchsize=1 and batchsize=32
Bs 1 forward time 0.008 backward time 0.017 epoch_time 0.025
Bs 32 forward time 0.021 backward time 0.296 epoch_time 0.317

It seems batch size has a big influence on the speed, which contradicts my knowledge because you know when training a network (instead of inputs), different batch size has little impact of speed. I also run a test of time for the ordinary image classification.
Bs 1 forward time 0.0018 backward time 0.006
Bs 32 forward time 0.0021 backward time 0.006

I really don’t know what’s happening here. Does any one knows why they differ? Or even improve the speed of training inputs with large batch size ?

It looks like you’re running all your code on cuda but you don’t do any synchronization when timing.
The cuda api is asynchronous, so you need to manually add torch.cuda.synchronize() before calling time.time() if you want to measure the actual runtime.

1 Like

Thanks a lot! That’s very helpful.