Can distributed training be used to solve the CUDA out of memory error

I am finetuning DialoGPT-2 and VGG-16 together for a particular task but dude to the heavy nature of both models, the training loop is crashing before the first iteration even completes. I was reading up about distributed training. Can it be used to solve this issue. If not what other solutions can I apply.

Data-Parallel: Distributed training either Distributed Data Parallel or Data Parallel across multiple GPUs (or nodes with ddp) should let you use a bigger batch, which could help. For example if you are out-of-memory with a batch of 16 you could convert that to a batch of 4 per GPU (and have *4 GPUs). You could have a small issue if your batch is too small per-GPU and you are used regular batch-norm since the batch statistics are not broadcast unless you use syn-batch-norm which is quite a bit slower.

Model-Parallel: However, if even with a batch size of 1 you are running out of memory then you could investigate whether the model can be split across multiple gpus

Synthetic Batch -> Get a larger batch without increasing GPUs

This will run slow but you could accumulate your loss-gradients across several batches and then only step your optimizer every n_forward passes. For example if you want a batch of 16 but can only fit a batch of 2:

# How many forward-passes to do before updating gradient
num_steps_per_grad = 8

iter = 0
for data in dataloader:
	iter += 1
	# ...
	loss.backward()
	# if batch_size == 2 & num_steps_per_grad == 8 then ~batch_size of 16
	if iter == num_steps_per_grad:
		iter = 0
		optimizer.step()
		optimizer.zero_grad()
		lr_sched.step()