Multi GPU memory address assignment issue and semaphore resource tracking issue

I am facing another problem using the DistributedDataParallel code while running on 8 GPUS let me share the error. Could you please tell me what is wrong right now? All of these following codes are running perfectly in my Single GPU machine but not in the high configuration GPU clusters!!

(dl) root@211c0e5f7017:/data# python -m graph.main_gmm_train_eval_multi_gpu
Namespace(dataset='CORA', device_idx=0, dropout=0.5, early_stopping=50, epochs=10, hidden=16, kernel_size=8, lr=0.1, runs=10, weight_decay=0.01)
Let's use 8 GPUs!
Namespace(dataset='CORA', device_idx=0, dropout=0.5, early_stopping=50, epochs=10, hidden=16, kernel_size=8, lr=0.1, runs=10, weight_decay=0.01)
Entered
[W socket.cpp:558] [c10d] The client socket has failed to connect to [localhost]:12345 (errno: 99 - Cannot assign requested address).
Namespace(dataset='CORA', device_idx=0, dropout=0.5, early_stopping=50, epochs=10, hidden=16, kernel_size=8, lr=0.1, runs=10, weight_decay=0.01)
Entered
[W socket.cpp:558] [c10d] The client socket has failed to connect to [localhost]:12345 (errno: 99 - Cannot assign requested address).
Namespace(dataset='CORA', device_idx=0, dropout=0.5, early_stopping=50, epochs=10, hidden=16, kernel_size=8, lr=0.1, runs=10, weight_decay=0.01)
Entered
[W socket.cpp:558] [c10d] The client socket has failed to connect to [localhost]:12345 (errno: 99 - Cannot assign requested address).
Namespace(dataset='CORA', device_idx=0, dropout=0.5, early_stopping=50, epochs=10, hidden=16, kernel_size=8, lr=0.1, runs=10, weight_decay=0.01)
Entered
[W socket.cpp:558] [c10d] The client socket has failed to connect to [localhost]:12345 (errno: 99 - Cannot assign requested address).
Namespace(dataset='CORA', device_idx=0, dropout=0.5, early_stopping=50, epochs=10, hidden=16, kernel_size=8, lr=0.1, runs=10, weight_decay=0.01)
Entered
[W socket.cpp:558] [c10d] The client socket has failed to connect to [localhost]:12345 (errno: 99 - Cannot assign requested address).
Namespace(dataset='CORA', device_idx=0, dropout=0.5, early_stopping=50, epochs=10, hidden=16, kernel_size=8, lr=0.1, runs=10, weight_decay=0.01)
Entered
[W socket.cpp:558] [c10d] The client socket has failed to connect to [localhost]:12345 (errno: 99 - Cannot assign requested address).
Namespace(dataset='CORA', device_idx=0, dropout=0.5, early_stopping=50, epochs=10, hidden=16, kernel_size=8, lr=0.1, runs=10, weight_decay=0.01)
Entered
[W socket.cpp:558] [c10d] The client socket has failed to connect to [localhost]:12345 (errno: 99 - Cannot assign requested address).
Namespace(dataset='CORA', device_idx=0, dropout=0.5, early_stopping=50, epochs=10, hidden=16, kernel_size=8, lr=0.1, runs=10, weight_decay=0.01)
Entered
data set information->  Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708], edge_attr=[10556, 2])
data set information->  Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708], edge_attr=[10556, 2])
data set information->  Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708], edge_attr=[10556, 2])
data set information->  Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708], edge_attr=[10556, 2])
in_channels->  1433  out_channels->  16  dim->  2  kernel_size->  8  separate_gaussians->  False  root_weight->  True
data set information->  Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708], edge_attr=[10556, 2])
in_channels->  16  out_channels->  7  dim->  2  kernel_size->  8  separate_gaussians->  False  root_weight->  True
data set information->  Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708], edge_attr=[10556, 2])
data set information->  Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708], edge_attr=[10556, 2])
in_channels->  1433  out_channels->  16  dim->  2  kernel_size->  8  separate_gaussians->  False  root_weight->  True
in_channels->  16  out_channels->  7  dim->  2  kernel_size->  8  separate_gaussians->  False  root_weight->  True
data set information->  Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708], edge_attr=[10556, 2])
in_channels->  1433  out_channels->  16  dim->  2  kernel_size->  8  separate_gaussians->  False  root_weight->  True
in_channels->  16  out_channels->  7  dim->  2  kernel_size->  8  separate_gaussians->  False  root_weight->  True
in_channels->  1433  out_channels->  16  dim->  2  kernel_size->  8  separate_gaussians->  False  root_weight->  True
in_channels->  16  out_channels->  7  dim->  2  kernel_size->  8  separate_gaussians->  False  root_weight->  True
in_channels->  1433  out_channels->  16  dim->  2  kernel_size->  8  separate_gaussians->  False  root_weight->  True
in_channels->  16  out_channels->  7  dim->  2  kernel_size->  8  separate_gaussians->  False  root_weight->  True
in_channels->  1433  out_channels->  16  dim->  2  kernel_size->  8  separate_gaussians->  False  root_weight->  True
in_channels->  16  out_channels->  7  dim->  2  kernel_size->  8  separate_gaussians->  False  root_weight->  True
in_channels->  1433  out_channels->  16  dim->  2  kernel_size->  8  separate_gaussians->  False  root_weight->  True
in_channels->  1433  out_channels->  16  dim->  2  kernel_size->  8  separate_gaussians->  False  root_weight->  True
in_channels->  16  out_channels->  7  dim->  2  kernel_size->  8  separate_gaussians->  False  root_weight->  True
in_channels->  16  out_channels->  7  dim->  2  kernel_size->  8  separate_gaussians->  False  root_weight->  True
Traceback (most recent call last):
  File "/root/miniconda3/envs/dl/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/envs/dl/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/data/graph/main_gmm_train_eval_multi_gpu.py", line 140, in <module>
    mp.spawn(run, args=(world_size, dataset), nprocs=world_size, join=True)
  File "/root/miniconda3/envs/dl/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/root/miniconda3/envs/dl/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/root/miniconda3/envs/dl/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/root/miniconda3/envs/dl/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/data/graph/main_gmm_train_eval_multi_gpu.py", line 94, in run
    model = DistributedDataParallel(model, device_ids=[rank])
  File "/root/miniconda3/envs/dl/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 646, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/root/miniconda3/envs/dl/lib/python3.8/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1659484810403/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.

(dl) root@211c0e5f7017:/data# python -m graph.main_gatGmm_train_eval_multi_gpu
Namespace(dataset='CORA', device_idx=0, dropout=0.5, early_stopping=50, epochs=10, hidden=16, kernel_size=2, lr=0.01, runs=10, weight_decay=0.0005)
Let's use 8 GPUs!
/root/miniconda3/envs/dl/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
Bus error (core dumped)
(dl) root@211c0e5f7017:/data# python -m graph.main_gcn_CTI_multi_gpu
Namespace(dataset='ctiRaw', device_idx=0, dropout=0.5, early_stopping=50, epochs=10, hidden=16, kernel_size=8, lr=0.1, runs=10, weight_decay=0.01)
Let's use 8 GPUs!
/root/miniconda3/envs/dl/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
Bus error (core dumped)
(dl) root@211c0e5f7017:/data# python -m graph.main_gcn_train_eval_multi_gpu
Namespace(dataset='CORA', device_idx=0, dropout=0.5, early_stopping=50, epochs=2, hidden=16, kernel_size=8, lr=0.1, runs=10, weight_decay=0.01)
Let's use 8 GPUs!
/root/miniconda3/envs/dl/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
Bus error (core dumped)

Could you try prefacing your run command with export NCCL_IB_DISABLE=1; export NCCL_P2P_DISABLE=1; NCCL_DEBUG=INFO?