Distributed training of multiple models on multiple nodes (CPU only)

Hi all,

I have been trying to figure out how to train a population of models on multiple nodes (which do not have GPUs, but that’s not the main point; I’m happy with training on CPUs). Ideally, I would like a single process per model running on a separate CPU. I can request hundreds or thousands of CPUs, and each model is fully contained, meaning that I don’t want to share any parameters from one model across nodes; rather, I want each model to train on its own CPU.

I have tried using a worker pool from torch.multiprocessing and passing models to the training function. I train each model for one epoch, then I perform some processing in the main process and then I map them again to the worker pool to train them for another epoch, and so on. That works fine if I run the models on a single machine, but it doesn’t scale up to a multi-node scenario because torch.multiprocessing is not aware of the additional nodes (I requested 256 CPUs on the cluster, which translates to 8 nodes with 16 CPUs each, but 7 of those remained idle).

As far as I can tell, all examples I found (for example, using torch.distributed here) assume that you have a single large model and you want to spread the training of one model across multiple workers. This is not my case - my models are small and I’d like to train them in parallel but independently of each other. They are, however, being trained on the same task using the same data, in case that’s relevant.

Any help would be appreciated! Apologies if I’m missing something obvious.

2 Likes

IIUC, DistributedDataParallel does not fit in this use case because you have a population of independent models to train on the same set of data instead one big model on different splits of input data. It looks like the experimental torch.distributed.rpc might be helpful here, it would at least help you take care of the communication. But you would still need to write code to dispatch models to the workers in the pool.