Predefined model and repeated training calls using Distributed Data Parallel


I am trying to implement DDP to speed training in my project. I attempted to follow multiple tutorials (example). In all these tutorials, the model and dataloaders are created inside the spawned processes. This does not work for me, as I have to produce some metrics for the model and modify it between epochs and repeat this process numerous times. Is it possible to create the model in the main process, where I can perform the inter-epoch modifications and have the workers run the training loop again upon command from the main process?

For those curious, I am trying to use DDP with the Hardware Aware Quantization framework.

If I understand correctly, you are trying to modify the model architecture between epochs during training. This doesn’t fit well into the DDP paradigm as it expects the model to be replicated across all workers so somehow this needs to be synced correctly.

I think RPC could be a potential solution for you. You can store the model on the main process (rank 0), perform updates, and use RPC calls passing in the model.state_dict to have workers execute the training loop.

As an alternative, you can rewrap your model with DDP again after modifying it.


I’ll look into this.

That would be an ideal solution, but the modifications will happen on the master process. How do I communicate the updated model to the other processes?