I am finetuning DialoGPT-2 and VGG-16 together for a particular task but dude to the heavy nature of both models, the training loop is crashing before the first iteration even completes. I was reading up about distributed training. Can it be used to solve this issue. If not what other solutions can I apply.
Data-Parallel: Distributed training either Distributed Data Parallel or Data Parallel across multiple GPUs (or nodes with ddp) should let you use a bigger batch, which could help. For example if you are out-of-memory with a batch of 16 you could convert that to a batch of 4 per GPU (and have *4 GPUs). You could have a small issue if your batch is too small per-GPU and you are used regular batch-norm since the batch statistics are not broadcast unless you use syn-batch-norm which is quite a bit slower.
Model-Parallel: However, if even with a batch size of 1 you are running out of memory then you could investigate whether the model can be split across multiple gpus
Synthetic Batch -> Get a larger batch without increasing GPUs
This will run slow but you could accumulate your loss-gradients across several batches and then only step your optimizer every n_forward passes. For example if you want a batch of 16 but can only fit a batch of 2:
# How many forward-passes to do before updating gradient
num_steps_per_grad = 8
iter = 0
for data in dataloader:
iter += 1
# ...
loss.backward()
# if batch_size == 2 & num_steps_per_grad == 8 then ~batch_size of 16
if iter == num_steps_per_grad:
iter = 0
optimizer.step()
optimizer.zero_grad()
lr_sched.step()