Pytorch distributed elastic: Socket Timeout

Master Node Error:
I got why the NcclInternalError was happening.
I disabled ufw firewall in both the computers, but this doest implies there is no other firewall
I got this error after registering distributed process

>>  TORCH_DISTRIBUTED_DEBUG=INFO TORCH_DISTRIBUTED_DEBUG=INFO  TORCH_DISTRIBUTED_DEBUG=INFO  NCCL_SOCKET_IFNAME=enp3s0  python -m torch.distributed.run --nnodes=2    --nproc_per_node=1     --node_rank=0    --master_addr=172.16.96.59     models/EfficientNetb3/AutoEncoder.py --debug
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1670525541990/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL 
version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Net : Connect to 172.16.96.60<56713> failed : Connection timed out
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 253472) of binary: /home/windows/miniconda3/envs/ray/bin/python
ERROR:torch.distributed.elastic.agent.server.api:Error waiting on exit barrier. Elapsed: 307.3943808078766 seconds
Traceback (most recent call last):
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 906, in _exit_barrier
    store_util.barrier(
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 78, in barrier
    synchronize(store, data, rank, world_size, key_prefix, barrier_timeout)
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 64, in synchronize
    agent_data = get_all(store, rank, key_prefix, world_size)
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all
    data = store. Get(f"{prefix}{idx}")
RuntimeError: Socket Timeout
Traceback (most recent call last):
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
models/EfficientNetb3/AutoEncoder.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-15_20:37:23
  host      : hostssh
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 253472)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Worker Node:

>> TORCH_DISTRIBUTED_DEBUG=INFO  NCCL_DEBUG=INFO  TORCH_CPP_LOG_LEVEL=INFO  NCCL_SOCKET_IFNAME=enp3s0 python -m torch.distributed.run --nnodes=2 --nproc_per_node=1  --node_rank=1 --master_addr=172.16.96.59  models/EfficientNetb3/AutoEncoder.py --debug
[I debug.cpp:49] [c10d] The debug level is set to INFO.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (172.16.96.59, 29500).
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:172.16.96.59]:29500 on [hostssh68]:34672.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (172.16.96.59, 29500).
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:172.16.96.59]:29500 on [hostssh68]:34678.
[I debug.cpp:49] [c10d] The debug level is set to INFO.
{'max_epochs': 100, 'benchmark': True, 'weights_summary': 'full', 'precision': 16, 'gradient_clip_val': 0.0, 'auto_lr_find': True, 'auto_scale_batch_size': True, 'check_val_every_n_epoch': 1, 'fast_dev_run': False, 'enable_progress_bar': True, 'detect_anomaly': True}
1
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (172.16.96.59, 29500).
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:172.16.96.59]:29500 on [hostssh68]:41222.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (172.16.96.59, 29500).
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:172.16.96.59]:29500 on [hostssh68]:41234.
[I ProcessGroupNCCL.cpp:669] [Rank 1] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:835] [Rank 1] NCCL watchdog thread started!
hostssh68:170982:170982 [0] NCCL INFO cudaDriverVersion 11070
hostssh68:170982:170982 [0] NCCL INFO Bootstrap : Using enp3s0:172.16.96.60<0>
hostssh68:170982:170982 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
hostssh68:170982:183832 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
hostssh68:170982:183832 [0] NCCL INFO NET/Socket : Using [0]enp3s0:172.16.96.60<0>
hostssh68:170982:183832 [0] NCCL INFO Using network Socket

Double post from here.