Pytorch DDP replica size mismatch error

I am trying to train GNN using DDP multi-node GPUs. I am using pytorch 1.7 and gloo as backend. I get an error like below on machine 2.

-- Process 0 terminated with the following error:
Traceback (most recent call last):
......
  File "/home/karthi/venvv/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 532, in _ddp_init_helper
    self.gradient_as_bucket_view)
RuntimeError: replicas[0][0] in this process with sizes [64, 5] appears not to match sizes of the same param in process 0.

But the same code works in single machine multi-gpu DDP. Please help me.This text will be hidden

Hey @karthi0804, could you please share code of a repro?

This error is from here, indicating that parameter sizes/order do not match across processes. Could you please verify if that is the case by printing sizes of params in model.parameters()?

cc @Yanli_Zhao

My bad. Extremely sorry! silly mistake. Wrong model architecture in another machine!

Hi, I am also getting this replicas error…
I am using windows 10, torch 1.7.1, pytorch-lightning 1.1.7 with 3 gpus.

The model training was working well with ddp and 2 gpus, on another machine (win10, torch 1.7.1 and pl 1.1.7)

the code crashed after printed the following error message:

self.reducer = dist.Reducer(

RuntimeError: replicas[0][0] in this process with sizes [12, 6] appears not to match sizes of the same param in process 0.

Please help!

My issue was that I failed to update the model in the second machine as the same version of the first machine. Pls, check it by printing and comparing the model params.

Thanks check back @karthi0804, i copied exact model and all supporting code over from previous machine, because this new machine is just built from scratch. did a lot of other tests, the new machine also failed at same replicas error when tested with two gpus (by physically disconnecting the 3rd one).

however, after tinkering around and setting pl.Trainer with ‘accelerator=ddp_spawn’ instead of ‘ddp’, it works, even though the pytorch lightning docs explicitly warns against using the former. I don’t exactly know what the limitations they warn about ‘ddp_spawn’–so far i can not observe one with naked eyes, hopefully we can fix the error with ‘ddp’ here or there and switch back.

Hi, I encountered this issue when using the following 2d group convolution in my code:

nn.Conv2d(n_feat, n_feat, kernel_size=5, padding=2, groups=n_feat)

The 2d group convolution worked well in my previous code with DDP mode. After transferring my model to a new framework, it reported the replica mismatch error. I also tested other models in the new framework with the 2d group convolution, and they all worked well.

The environment is based on the docker image ‘pytorch/pytorch:1.7.0-cuda11.0-cudnn8-runtime’.

I really need help!