Hi,
I’m using pytorchDDP with VGG16. I set the bucket_size=25mb. and the size of VGG16 is about 528mb. but all_reduce kernel is launched 6 times. (and 5 times for 50mb bucket size, 3 times for 10mb bucket size.) I think 6x25 does not match 528.
and Communication operations Stats of Distributed tap on Tensorboard says the Avg Size(bytes) of all_reduce is nearly 92mb. I understand that Avg Size column is the size of transferred Data. 92mb also does not match 25mb(bucket size)
I understand the all_reduce is launched when the bucket is full of gradient. Is the model size different with gradient size? I’m confused. Any help would be greatly appreciated.
Thank you.