You have to reduce your bath size. But if you reduce your batch size further this might affect model training performance. So in this case you should use grading accumulation technique. Using which your model’s loss with small batch size would converge as if the batch size if larger.
Let’s say you want to set your batch size to “original_batch_size” but you are getting stable training without any “cuda out of memory” for a smaller batch of “new_batch_size”, for that you can train your model in following way:
# clear last step
optimizer.zero_grad()
accumulations = original_batch_size / new_batch_sizs
scaled_loss = 0
for accumulated_step_i in range(accumulations):
out = model.forward()
loss = some_loss(out,y)
loss.backward()
scaled_loss += loss.item()
# now update the model.... So affective batch size will be original_bath_size * accumulations
optimizer.step()
# loss is now scaled up by the number of accumulated batches
actual_loss = scaled_loss / accumulations
More about it here