I am trying to train GNN using DDP multi-node GPUs. I am using pytorch 1.7 and gloo as backend. I get an error like below on machine 2.
-- Process 0 terminated with the following error:
Traceback (most recent call last):
......
File "/home/karthi/venvv/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 532, in _ddp_init_helper
self.gradient_as_bucket_view)
RuntimeError: replicas[0][0] in this process with sizes [64, 5] appears not to match sizes of the same param in process 0.
But the same code works in single machine multi-gpu DDP. Please help me.This text will be hidden
Hey @karthi0804, could you please share code of a repro?
This error is from here, indicating that parameter sizes/order do not match across processes. Could you please verify if that is the case by printing sizes of params in model.parameters()?
My issue was that I failed to update the model in the second machine as the same version of the first machine. Pls, check it by printing and comparing the model params.
Thanks check back @karthi0804, i copied exact model and all supporting code over from previous machine, because this new machine is just built from scratch. did a lot of other tests, the new machine also failed at same replicas error when tested with two gpus (by physically disconnecting the 3rd one).
however, after tinkering around and setting pl.Trainer with ‘accelerator=ddp_spawn’ instead of ‘ddp’, it works, even though the pytorch lightning docs explicitly warns against using the former. I don’t exactly know what the limitations they warn about ‘ddp_spawn’–so far i can not observe one with naked eyes, hopefully we can fix the error with ‘ddp’ here or there and switch back.
The 2d group convolution worked well in my previous code with DDP mode. After transferring my model to a new framework, it reported the replica mismatch error. I also tested other models in the new framework with the 2d group convolution, and they all worked well.
The environment is based on the docker image ‘pytorch/pytorch:1.7.0-cuda11.0-cudnn8-runtime’.