How to increase batch size with limited GPU memory

I’m facing challenge working on NLP application, where I can provide batch size at max 2 due to memory issue (I’m using 8 gb GPU).

I want to increase my batch size because model is not converging well with small batch size.
My question is instead of using the gradient accumulation, can i use the following procedure ?

“batch_size” is the required batch size, say 16
“max_batch_size” is the max batches can allocate on limited memory, say 2

1)calculate the loss for batch size 16 in forward pass without grads
     2) the pass the batch size of 2 and calculate the gradients
        3) update the gradients by averaging  on each iteration


for epoch in range(epochs):

for idx,sample in enumerate(dataloader):
    # dataloader with batch_size =16
    # here calculating the loss for batch size of 16
    with torch.no_grads():
        outputs = model(inputs)
    loss = loss_fn(outputs, labels)    

    # Pass the batch size of 2 (becuase of memory constraint) and calculate teh gradients
    for idx in range(0,batch_size,max_batch_size): 
        inputs, labels = sample[idx:idx+max_batch_size]

        # Forward Pass
        outputs = model(inputs)
        model.grads  /= 2


No, your approach won’t work since loss was created from an output tensor which is not attached to any computation graph. Calling loss.backward() will fail. Besides that your approach looks indeed like gradient accumulation since you are executing the forward pass multiple times with smaller batch sizes so you could directly call backward on these output tensors.

Thank you very much for your quick response.

One more doubt, is there any difference in output if i use gradient accumulation NUM_ACCUMULATION_STEPS = 4 with batch size of 4 instead of directly defining batch size of 16 ?

Because in gradient accumulation gradients are added and loss will be divided by NUM_ACCUMULATION_STEPS.

Where in defining batch size 16, calculated gradients on loss for sample size of 16 will be averaged. How these 2 things same ?

What I tried to do in the above code was to replicate the latter one.
Is my thought process correct? Is it possible to do ?

Since you are scaling the losses during gradient accumulation the gradients should be the same at the end. However, some layers use the batch size for their computation, such as batchnorm layers which calculate the stats from the incoming activation tensor, which would show different behavior in the gradient accumulation case vs. the large batch size case.