DDP: replicas[0][0] in this process with sizes [12, 6] appears not to match sizes

Hi,

I am getting this replicas error today.
setup: windows 10, torch 1.7.1, pytorch-lightning 1.1.7 with 3 gpus.

The model training was working well with ddp and 2 gpus, on another machine (same setup w/ win10, torch 1.7.1 and pl 1.1.7)

the code crashed after printed the following error message:

self.reducer = dist.Reducer(
RuntimeError: replicas[0][0] in this process with sizes [12, 6] appears not to match sizes of the same param in process 0.

Please help!

This happens if the model parameters are not the same across all replicas in DDP. Have you tried printing the sizes of all the params in the model from each rank (using model.parameters())? This would be the first thing to verify mismatched sizes.

Can you also provide your code to repro?

Hi thanks for quick reply!
I am using pytorch-lightning 1.1.7 on top of the torch 1.7.1. I don’t directly call torch api’s, but using lightning’s Trainer, and model.fit .

The lightning underline indeed printed out the model parameters, all three are the same, (they all rounded to thousand though).

The weird thing is that, the training worked very well on first machine, which has two gpus. The problem happened to the second machine, which has 3 gpus. But even after i removed 1 gpu from the the machine, training with 2 gpus on this machine still fails w/ replicas error.

This is how lightning Trainer is initialed and then fit is called:
“”"
self.trainer = pl.Trainer(
max_epochs=configs[“max_epochs”],
gpus=[0, 1],
accelerator=‘ddp’,
weights_summary=“top”,
gradient_clip_val=0.1,
limit_train_batches=30,
callbacks=[lr_logger, early_stop_callback, checkpoint_callback],
)

model = …

self.trainer.fit(
model,
train_dataloader=self.train_dataloader,
val_dataloaders=self.val_dataloader,
)
“”"

Are there any findings for this?
Later tried with accelerator=‘ddp_spawn’, and the replicas error seemingly disappeared.
But the training with ‘ddp_spawn’ very easily get stuck or crash after a few epochs, with error messages like this:

File “D:\installed\anaconda3\envs\TorchB\lib\site-packages\pytorch_lightning\trainer\training_loop.py”, line 720, in train_step_and_backward_closure
result = self.training_step_and_backward(
File “D:\installed\anaconda3\envs\TorchB\lib\site-packages\pytorch_lightning\trainer\training_loop.py”, line 828, in training_step_and_backward
self.backward(result, optimizer, opt_idx)
File “D:\installed\anaconda3\envs\TorchB\lib\site-packages\pytorch_lightning\trainer\training_loop.py”, line 850, in backward
result.closure_loss = self.trainer.accelerator_backend.backward(
File “D:\installed\anaconda3\envs\TorchB\lib\site-packages\pytorch_lightning\accelerators\accelerator.py”, line 104, in backward
model.backward(closure_loss, optimizer, opt_idx, *args, **kwargs)
File “D:\installed\anaconda3\envs\TorchB\lib\site-packages\pytorch_lightning\core\lightning.py”, line 1158, in backward
loss.backward(*args, **kwargs)
File “D:\installed\anaconda3\envs\TorchB\lib\site-packages\torch\tensor.py”, line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “D:\installed\anaconda3\envs\TorchB\lib\site-packages\torch\autograd_init_.py”, line 130, in backward
Variable._execution_engine.run_backward(
RuntimeError: bad allocation

Additional info: I am using ‘gloo’ backend, and init_method=“file:/// …”.