I am trying to train GNN using DDP multi-node GPUs. I am using pytorch 1.7 and gloo as backend. I get an error like below on machine 2.
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/karthi/venvv/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 532, in _ddp_init_helper
RuntimeError: replicas in this process with sizes [64, 5] appears not to match sizes of the same param in process 0.
But the same code works in single machine multi-gpu DDP. Please help me.This text will be hidden
Hey @karthi0804, could you please share code of a repro?
This error is from here, indicating that parameter sizes/order do not match across processes. Could you please verify if that is the case by printing sizes of params in
My bad. Extremely sorry! silly mistake. Wrong model architecture in another machine!
Hi, I am also getting this replicas error…
I am using windows 10, torch 1.7.1, pytorch-lightning 1.1.7 with 3 gpus.
The model training was working well with ddp and 2 gpus, on another machine (win10, torch 1.7.1 and pl 1.1.7)
the code crashed after printed the following error message:
self.reducer = dist.Reducer(
RuntimeError: replicas in this process with sizes [12, 6] appears not to match sizes of the same param in process 0.
My issue was that I failed to update the model in the second machine as the same version of the first machine. Pls, check it by printing and comparing the model params.
Thanks check back @karthi0804, i copied exact model and all supporting code over from previous machine, because this new machine is just built from scratch. did a lot of other tests, the new machine also failed at same replicas error when tested with two gpus (by physically disconnecting the 3rd one).
however, after tinkering around and setting pl.Trainer with ‘accelerator=ddp_spawn’ instead of ‘ddp’, it works, even though the pytorch lightning docs explicitly warns against using the former. I don’t exactly know what the limitations they warn about ‘ddp_spawn’–so far i can not observe one with naked eyes, hopefully we can fix the error with ‘ddp’ here or there and switch back.