Regarding the timing of synchronization of model's weights when trained using DDP

tanabe.d.zc32s · February 1, 2023, 9:20am

Hello.
I have recently been trying my hand at learning with Data Parallel using Distributed Data Parallel (DDP).

I understand that learning with DDP is done by creating replicas of the model on multiple devices (e.g. GPUs), splitting the data and training them, and synchronizing the weights. My question is, when are the weights synchronized between replicas? At each epoch?

I would appreciate it if you could enlighten me.
Best Regards.

ptrblck · February 1, 2023, 10:26am

DDP will broadcast the state_dict during the construction of the DDP object as described in the DDP - Internal Design docs.

tanabe.d.zc32s · February 3, 2023, 12:27am

Thank you for reply.
I understand about broadcasting.