Loss.backward() error in MultiGPU Training on a single node

When I am trying to train my model on 4 GPUs I face the bellow error.
"CUDA_VISIBLE_DEVICES=0,1,2,3 python my_script.py " and the error is related to “loss.backward()”

Here is the error:
"/opt/conda/envs/Rohit/lib/python3.7/site-packages/torch/nn/modules/module.py:1082: UserWarning: Using non-full backward hooks on a Module that does not return a single Tensor or a tuple of Tensors is deprecated and will be removed in future versions. This hook will be missing some of the grad_output. Please use register_full_backward_hook to get the documented behavior.
warnings.warn("Using non-full backward hooks on a Module that does not return a "
Error executing job with overrides: [‘model=default’]
Traceback (most recent call last):
File “experiments/mobileRobot/mobile_robot_hiprssm.py”, line 87, in my_app
exp = Experiment(model_cfg)
File “experiments/mobileRobot/mobile_robot_hiprssm.py”, line 93, in init
self._experiment()
File “experiments/mobileRobot/mobile_robot_hiprssm.py”, line 193, in _experiment
transformer_learn.train(train_obs, train_act, train_targets, train_task_idx, cfg.learn.epochs, cfg.learn.batch_size, test_obs, test_act,test_targets, test_task_idx)
File “./learning/transformer_trainer.py”, line 583, in train
batch_size)
File “./learning/transformer_trainer.py”, line 244, in train_step
loss.backward()
File “/opt/conda/envs/Rohit/lib/python3.7/site-packages/torch/_tensor.py”, line 489, in backward
self, gradient, retain_graph, create_graph, inputs=inputs
File “/opt/conda/envs/Rohit/lib/python3.7/site-packages/torch/autograd/init.py”, line 199, in backward
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
File “/opt/conda/envs/Rohit/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 61, in call
raise RuntimeError(“You are trying to call the hook of a dead Module!”)
RuntimeError: You are trying to call the hook of a dead Module!
"
I can train exactly the same code on CPU and on a single GPU properly , i.e,
“CUDA_VISIBLE_DEVICES=0 python my_script.py”, works fine.

as a reference I used (Optional: Data Parallelism — PyTorch Tutorials 1.13.1+cu117 documentation and Multi-GPU Training in Pytorch: Data and Model Parallelism – Glass Box) for dataparallelism.

This is a bit difficult to debug without any visible code. Could you post a minimal version of your script (e.g., with data loading replaced with randomly generated values) that reproduces your issue?