Error with distributed training

Hi all,

I’m attempting to distribute the training of my network with the distributed data parallel strategy. However, I’m getting a runtime error:

RuntimeError: The size of tensor a (3) must match the size of tensor b (0) at non-singleton dimension 1

I found the function that raised the error ( _distributed_broadcast_coalesced(self, tensors, buffer_size, authoritative_rank=0)) and printed the shapes of the tensors going into the function and obtained this output:

torch.Size([30, 240])
torch.Size([30])
torch.Size([30, 30])
torch.Size([30])
torch.Size([30, 30])
torch.Size([30])
torch.Size([1, 30])
torch.Size([1])
torch.Size([30, 240])
torch.Size([30])
torch.Size([30, 30])
torch.Size([30])
torch.Size([30, 30])
torch.Size([30])
torch.Size([1, 30])
torch.Size([1])
torch.Size([30, 240])
torch.Size([30])
torch.Size([30, 30])
torch.Size([30])
torch.Size([30, 30])
torch.Size([30])
torch.Size([1, 30])
torch.Size([1])
torch.Size([1, 1])
torch.Size([1, 16])
torch.Size([1, 1, 1, 1])
torch.Size([1, 1, 1, 1])
torch.Size([1, 1, 4, 1])
torch.Size([1, 1, 1, 8])
torch.Size([3, 3])
torch.Size([3, 3])
torch.Size([0, 3])

I’m not really sure where the error is coming from

Not sure if you’ve solved this already but I was running into the same error. The issue for me was a parameter with an empty dimension. It looks like your last parameter also has the same issue here. I believe the DDP constructor coalesces tensors in the same order that is returned by model.named_parameters() so you could use that to figure out which parameter specifically is empty.

Hi Norman,

Thanks for your response! I haven’t solved this issue, but it does work with Pytorch Lightning’s implementation of the ‘Bagua’ distributed training strategy. In my model, it looks like my MSE loss has 0 paramaters - could this be causing the issue?

| Name | Type | Params

0 | H_network | Sequential | 9.1 K
1 | C_network | Sequential | 9.1 K
2 | S_network | Sequential | 9.1 K
3 | mse | MSELoss | 0
4 | nn | ANIModel | 27.4 K
5 | model | Sequential | 27.4 K