Training multiple small models in parallel

Yitao_Chen · March 31, 2021, 6:31pm

I need to train multiple small models in parallel to speedup the training process using a node with four GPUs. The training process of each model uses about 50% of my GPUs.
In my use case, all the models are independent, so there is no synchronization between any of them.
I tried multiple methods, but nothing is working. Appreciate if anyone can give me some advice on this.

The following are the methods I tried and the problems with them:

DDP
Training time of a single model using DDP is longer than training a without it, even I turn off the synchronization between models. I am not sure where the additional overhead comes from.
Multiprocessing
The behavior of using multiprocessing is strange with cuda.
I created multiple threads and each of the thread runs a training process.
But the program only works for some times; when it does not work, it hangs without sending any error message.

Any advice is appreciated!

Bihy · November 19, 2021, 11:28am

Hi, I’m facing the same challenge here. did you find any solution? Thank you.