Pytorch DDP bucket size does not match model size


I’m using pytorchDDP with VGG16. I set the bucket_size=25mb. and the size of VGG16 is about 528mb. but all_reduce kernel is launched 6 times. (and 5 times for 50mb bucket size, 3 times for 10mb bucket size.) I think 6x25 does not match 528.

and Communication operations Stats of Distributed tap on Tensorboard says the Avg Size(bytes) of all_reduce is nearly 92mb. I understand that Avg Size column is the size of transferred Data. 92mb also does not match 25mb(bucket size)

I understand the all_reduce is launched when the bucket is full of gradient. Is the model size different with gradient size? I’m confused. Any help would be greatly appreciated.

Thank you.

Hi @yuri123, thanks for your question! Could you share a small code for repro? Specially on how you set the bucket size at initialization and point out to what model implementation you’re using?

Hi @aazzolini, Thanks for your reply.
I just set the bucket size like below

    model = VGGD.vgg16bn().cuda()
    ddp_model = DDP(model, device_ids=[rank], broadcast_buffers=True, 
    bucket_cap_mb=25, gradient_as_bucket_view=True)

and I’m sorry but I don’t understand what exactly model impletation means. ;(

as you can see the above code, I made and that’s like this

class vgg16bn(torch.nn.Module):
    def __init__(self):
        super(vgg16bn, self).__init__()
        self.model = nn.Sequential(OrderedDict([
                        ("block1_conv1", nn.Conv2d(3, 64, 3, padding = 1)),
                        ("block1_conv1_batchnorm", nn.BatchNorm2d(64)),
                        ("block1_conv1_ReLU", nn.ReLU(True)),
                        ("block1_conv2", nn.Conv2d(64, 64, 3, padding = 1)),

.+ To clarify the question I asked, when I set the bucket size 25mb, all_reduce kernel is launched 6 times per step.
(model size is about 528mb, so I think all_reduce is need to be excuted 528mb/25mb=21 or 22 times per step… please let me know what I’m wrong ;_; …)

Thank you!