DDP: replicas[0][0] in this process with sizes [12, 6] appears not to match sizes

BlockWaving · February 5, 2021, 12:05pm

Hi,

I am getting this replicas error today.
setup: windows 10, torch 1.7.1, pytorch-lightning 1.1.7 with 3 gpus.

The model training was working well with ddp and 2 gpus, on another machine (same setup w/ win10, torch 1.7.1 and pl 1.1.7)

the code crashed after printed the following error message:

self.reducer = dist.Reducer(
RuntimeError: replicas[0][0] in this process with sizes [12, 6] appears not to match sizes of the same param in process 0.

Please help!

osalpekar · February 5, 2021, 7:08pm

This happens if the model parameters are not the same across all replicas in DDP. Have you tried printing the sizes of all the params in the model from each rank (using model.parameters())? This would be the first thing to verify mismatched sizes.

Can you also provide your code to repro?

BlockWaving · February 6, 2021, 2:03am

Hi thanks for quick reply!
I am using pytorch-lightning 1.1.7 on top of the torch 1.7.1. I don’t directly call torch api’s, but using lightning’s Trainer, and model.fit .

The lightning underline indeed printed out the model parameters, all three are the same, (they all rounded to thousand though).

The weird thing is that, the training worked very well on first machine, which has two gpus. The problem happened to the second machine, which has 3 gpus. But even after i removed 1 gpu from the the machine, training with 2 gpus on this machine still fails w/ replicas error.

BlockWaving · February 6, 2021, 2:54am

This is how lightning Trainer is initialed and then fit is called:
“”"
self.trainer = pl.Trainer(
max_epochs=configs[“max_epochs”],
gpus=[0, 1],
accelerator=‘ddp’,
weights_summary=“top”,
gradient_clip_val=0.1,
limit_train_batches=30,
callbacks=[lr_logger, early_stop_callback, checkpoint_callback],
)

model = …

self.trainer.fit(
model,
train_dataloader=self.train_dataloader,
val_dataloaders=self.val_dataloader,
)
“”"

BlockWaving · February 11, 2021, 4:09am

Are there any findings for this?
Later tried with accelerator=‘ddp_spawn’, and the replicas error seemingly disappeared.
But the training with ‘ddp_spawn’ very easily get stuck or crash after a few epochs, with error messages like this:

File “D:\installed\anaconda3\envs\TorchB\lib\site-packages\pytorch_lightning\trainer\training_loop.py”, line 720, in train_step_and_backward_closure
result = self.training_step_and_backward(
File “D:\installed\anaconda3\envs\TorchB\lib\site-packages\pytorch_lightning\trainer\training_loop.py”, line 828, in training_step_and_backward
self.backward(result, optimizer, opt_idx)
File “D:\installed\anaconda3\envs\TorchB\lib\site-packages\pytorch_lightning\trainer\training_loop.py”, line 850, in backward
result.closure_loss = self.trainer.accelerator_backend.backward(
File “D:\installed\anaconda3\envs\TorchB\lib\site-packages\pytorch_lightning\accelerators\accelerator.py”, line 104, in backward
model.backward(closure_loss, optimizer, opt_idx, *args, **kwargs)
File “D:\installed\anaconda3\envs\TorchB\lib\site-packages\pytorch_lightning\core\lightning.py”, line 1158, in backward
loss.backward(*args, **kwargs)
File “D:\installed\anaconda3\envs\TorchB\lib\site-packages\torch\tensor.py”, line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “D:\installed\anaconda3\envs\TorchB\lib\site-packages\torch\autograd_init_.py”, line 130, in backward
Variable._execution_engine.run_backward(
RuntimeError: bad allocation

Additional info: I am using ‘gloo’ backend, and init_method=“file:/// …”.

huahuanZ · April 29, 2021, 11:59am

PyTorch DDP requires the params in all GPUs follow the same order. e.g.
GPU0: weight [4,4], bias [4]
GPU1: bias[4], weight [4,4]
This is invalid.
Try print all the parameters names and theirs sizes to text files. Just make sure they follow the same order.

with open(f"params_{args.rank}.txt", "w") as fo:
    for name, param in model.parameters():
        fo.write(f"{name}\t{param.size()}\n")