Training multiple small models in parallel

I need to train multiple small models in parallel to speedup the training process using a node with four GPUs. The training process of each model uses about 50% of my GPUs.
In my use case, all the models are independent, so there is no synchronization between any of them.
I tried multiple methods, but nothing is working. Appreciate if anyone can give me some advice on this.

The following are the methods I tried and the problems with them:

  1. DDP
    Training time of a single model using DDP is longer than training a without it, even I turn off the synchronization between models. I am not sure where the additional overhead comes from.

  2. Multiprocessing
    The behavior of using multiprocessing is strange with cuda.
    I created multiple threads and each of the thread runs a training process.
    But the program only works for some times; when it does not work, it hangs without sending any error message.

Any advice is appreciated!


Hi, Iā€™m facing the same challenge here. did you find any solution? Thank you.