I need to train multiple small models in parallel to speedup the training process using a node with four GPUs. The training process of each model uses about 50% of my GPUs.
In my use case, all the models are independent, so there is no synchronization between any of them.
I tried multiple methods, but nothing is working. Appreciate if anyone can give me some advice on this.
The following are the methods I tried and the problems with them:
-
DDP
Training time of a single model using DDP is longer than training a without it, even I turn off the synchronization between models. I am not sure where the additional overhead comes from. -
Multiprocessing
The behavior of using multiprocessing is strange with cuda.
I created multiple threads and each of the thread runs a training process.
But the program only works for some times; when it does not work, it hangs without sending any error message.
Any advice is appreciated!