I’m studying distribted training by using pytorchDDP.
I have a question about dataparallel. when GPUs communicate with each other(all_reduce), do they exchange gradients or parameters? I thought It was gradients.
But, I set the bucket size to 25mb when I use pytorchDDP, I can check the Avg size is 92mb, not 25mb. I think that 92mb is the transferred model parameters size that is divided by buckets. what is right?
(I used VGG16 model that is about 528mb, and I profiled 10 step. so Calls per step of all_reduce is 6)
and I want to know the relationship of model parameter size and gradient size.
I saw your comment in the other thread, but could you provide a minimal script to reproduce the behavior?
I’m sorry but I can’t give you the script. but I’m just wondering why the bucket size and Avg size(of transferred data) are different. I thought the bucket size and Avg size would be the same. I want to know what point I thought wrong.
Unfortunately, it is hard for me to say for sure without seeing the code or at least the full model definition. Would it be possible to print out the model and provide the parameter sizes?
One possibility off the top of my head is that there are some parameters that individually are large, which may cause a bucket to exceed the 25 MB limit. For your model, do you think this could be happening?