How can you train your model on large batches when your GPUs can only hold couple of batches

Oscar_Rangel · May 11, 2020, 12:35pm

I am writing this message for you as you always have helped me with very good answers.

I am doing Kaggle competitions but I always run on the problem that I can run bigger batch size and get really bad results.

I have two 2080 ti with 11 Gig of memory and trying to run images 300x300 with batch size 8 give me very bad results and with 16 it always tells me that CUDA ran out of memory…

Can you help me, please

albanD · May 11, 2020, 2:35pm

Hi,

If you have two GPUs, you can use the DataParallel module to use both GPUs which will allow you to double the batch size.

Another approach is to compute the whole gradient in multiple steps:

# effective_batch = 4 * batch
processed_batch = 0
for batch in dataloader:
  loss = get_loss(batch)
  loss.backward()
  processed_batch += 1
  if processed_batch == 4:
    opt.step()
    opt.zero_grad()
    processed_batch = 0

Oscar_Rangel · May 11, 2020, 3:10pm

@albanD, I tried that, the biggest batch size is 8… for the code you posted it is done using torch.utils.checkpoint to trade compute for memory, but I dont know how to implement it…

Thanks.

albanD · May 11, 2020, 3:17pm

checkpoint is different from what I posted.

batch_size_you_want = 64
max_batch_size = 8 # create dataloader with this

processed_samples = 0
for batch in dataloader:
  loss = get_loss(batch)
  loss.backward()
  processed_samples += max_batch_size
  if processed_samples >= batch_size_you_want:
    opt.step()
    opt.zero_grad()
    processed_samples = 0

This code will do training as if you were using a batch size of 64. Without ever using more memory than if you were using a batch size of 8.

checkpointing is a bit different. You will need to modify you model’s forward pass to wrap groups of operations in a torch.utils.checkpoint. This will allow you to free the memory used by the intermediary buffers that are between the ops that you gave to the checkpoint.