Are there any findings for this?
Later tried with accelerator=‘ddp_spawn’, and the replicas error seemingly disappeared.
But the training with ‘ddp_spawn’ very easily get stuck or crash after a few epochs, with error messages like this:
File “D:\installed\anaconda3\envs\TorchB\lib\site-packages\pytorch_lightning\trainer\training_loop.py”, line 720, in train_step_and_backward_closure
result = self.training_step_and_backward(
File “D:\installed\anaconda3\envs\TorchB\lib\site-packages\pytorch_lightning\trainer\training_loop.py”, line 828, in training_step_and_backward
self.backward(result, optimizer, opt_idx)
File “D:\installed\anaconda3\envs\TorchB\lib\site-packages\pytorch_lightning\trainer\training_loop.py”, line 850, in backward
result.closure_loss = self.trainer.accelerator_backend.backward(
File “D:\installed\anaconda3\envs\TorchB\lib\site-packages\pytorch_lightning\accelerators\accelerator.py”, line 104, in backward
model.backward(closure_loss, optimizer, opt_idx, *args, **kwargs)
File “D:\installed\anaconda3\envs\TorchB\lib\site-packages\pytorch_lightning\core\lightning.py”, line 1158, in backward
loss.backward(*args, **kwargs)
File “D:\installed\anaconda3\envs\TorchB\lib\site-packages\torch\tensor.py”, line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “D:\installed\anaconda3\envs\TorchB\lib\site-packages\torch\autograd_init_.py”, line 130, in backward
Variable._execution_engine.run_backward(
RuntimeError: bad allocation
Additional info: I am using ‘gloo’ backend, and init_method=“file:/// …”.