ncclInternalError: Internal check failed. Proxy Call to rank 0 failed (Connect)

After setting up ray cluster with 2 nodes of single gpu & also direct pytroch distributed run โ€ฆ with the same nodes i got my distributed process registered. starting with 2 process with backed nccl

NCCL INFO :

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
(RayExecutor pid=423719, ip=172.16.0.2) Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
(RayExecutor pid=508760) ----------------------------------------------------------------------------------------------------
(RayExecutor pid=508760) distributed_backend=nccl
(RayExecutor pid=508760) All distributed processes registered. Starting with 2 processes
(RayExecutor pid=508760) ----------------------------------------------------------------------------------------------------
(RayExecutor pid=508760) 
(RayExecutor pid=508760) GPU available: True (cuda), used: True (Please ignore the previous info [GPU used: False]).
(RayExecutor pid=508760) hostssh:508760:508760 [0] NCCL INFO Bootstrap : Using enp3s0:172.16.96.59<0>
(RayExecutor pid=508760) hostssh:508760:508760 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
(RayExecutor pid=508760) hostssh:508760:508760 [0] NCCL INFO cudaDriverVersion 11070
(RayExecutor pid=508760) NCCL version 2.14.3+cuda11.7

But as soon as this message i am getting an nccInternalError : Internal check failed

RayTaskError(RuntimeError): [36mray::RayExecutor.execute()[39m (pid=508760, ip=172.16.96.59, 
repr=<ray_lightning.launchers.utils.RayExecutor object at 0x7fa16a4327d0>)
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/ray_lightning/launchers/utils.py", 
line 52, in execute
    return fn(*args, **kwargs)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/ray_lightning/launchers/ray_launcher.py", 
line 301, in _wrapping_function
    results = function(*args, **kwargs)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 
811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 
1172, in _run
    self.__setup_profiler()
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 
1797, in __setup_profiler
    self.profiler.setup(stage=self.state.fn._setup_fn, local_rank=local_rank, log_dir=self.log_dir)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 
2249, in log_dir
    dirpath = self.strategy.broadcast(dirpath)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp_spawn.py", 
line 215, in broadcast
    torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line
2084, in broadcast_object_list
    broadcast(object_sizes_tensor, src=src, group=group)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line
1400, in broadcast
    work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: 
/opt/conda/conda-bld/pytorch_1670525541990/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, 
internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Proxy Call to rank 0 failed (Connect)
(ray) windows@hostssh:~/Video-Detection$ nvidia-smi
Tue Mar 14 20:40:29 2023  

I am running in on premise cluster without any containerization . And sinlge gpu code works successfully (with 16 batch size). so i need to do model parallel

Could you rerun your code with NCCL_DEBUG=INFO as well as TORCH_CPP_LOG_LEVEL=INFO and TORCH_DISTRIBUTED_DEBUG=INFO to get more information about the error, please?

1 Like
(ray) windows@hostssh:~/Video-Detection$ NCCL_DEBUG=INFO TORCH_CPP_LOG_LEVEL=INFO TORCH_DISTRIBUTED_DEBUG=INFO python3 models/EfficientNetb3/AutoEncoder.py
[I debug.cpp:49] [c10d] The debug level is set to INFO.
2023-03-14 22:23:21,750 INFO worker.py:1364 -- Connecting to existing Ray cluster at address: 172.16.96.59:6379...
2023-03-14 22:23:21,753 INFO worker.py:1544 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8000 
{'max_epochs': 100, 'weights_summary': 'full', 'precision': 16, 'gradient_clip_val': 0.0, 'auto_lr_find': True, 'auto_scale_batch_size': True, 'check_val_every_n_epoch': 1, 'fast_dev_run': False, 'enable_progress_bar': True, 'detect_anomaly': True}
1
Using 16bit native Automatic Mixed Precision (AMP)
Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.model_summary.ModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
(RayExecutor pid=615244) /home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/ray_lightning/launchers/utils.py:48: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
(RayExecutor pid=615244) Use get_node_id() instead
(RayExecutor pid=615244)   return ray.get_runtime_context().node_id.hex(), ray.get_gpu_ids()
(RayExecutor pid=427230, ip=172.16.0.2) /home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/ray_lightning/launchers/utils.py:48: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
(RayExecutor pid=427230, ip=172.16.0.2) Use get_node_id() instead
(RayExecutor pid=427230, ip=172.16.0.2)   return ray.get_runtime_context().node_id.hex(), ray.get_gpu_ids()
(RayExecutor pid=615244) /home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/utilities/warnings.py:53: LightningDeprecationWarning: pytorch_lightning.utilities.warnings.rank_zero_deprecation has been deprecated in v1.6 and will be removed in v1.8. Use the equivalent function from the pytorch_lightning.utilities.rank_zero module instead.
(RayExecutor pid=615244)   new_rank_zero_deprecation(
(RayExecutor pid=615244) /home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/utilities/warnings.py:58: LightningDeprecationWarning: ParallelStrategy.torch_distributed_backend was deprecated in v1.6 and will be removed in v1.8.
(RayExecutor pid=615244)   return new_rank_zero_deprecation(*args, **kwargs)
(RayExecutor pid=615244) Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
(RayExecutor pid=427230, ip=172.16.0.2) Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
(RayExecutor pid=615244) hostssh:615244:615244 [0] NCCL INFO Bootstrap : Using enp3s0:172.16.96.59<0>
(RayExecutor pid=615244) hostssh:615244:615244 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
(RayExecutor pid=615244) hostssh:615244:615244 [0] NCCL INFO cudaDriverVersion 11070
(RayExecutor pid=615244) NCCL version 2.14.3+cuda11.7
(RayExecutor pid=615244) ----------------------------------------------------------------------------------------------------
(RayExecutor pid=615244) distributed_backend=nccl
(RayExecutor pid=615244) All distributed processes registered. Starting with 2 processes
(RayExecutor pid=615244) ----------------------------------------------------------------------------------------------------
(RayExecutor pid=615244) 
(RayExecutor pid=615244) GPU available: True (cuda), used: True (Please ignore the previous info [GPU used: False]).
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ /home/windows/Video-Detection/models/EfficientNetb3/AutoEncoder.py:170 in <module>               โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   167 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   strategy=strategy                                                      โ”‚
โ”‚   168 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   )                                                                      โ”‚
โ”‚   169 โ”‚                                                                                          โ”‚
โ”‚ โฑ 170 โ”‚   trainer.fit(model, dataset)                                                            โ”‚
โ”‚   171 โ”‚                                                                                          โ”‚
โ”‚   172 โ”‚                                                                                          โ”‚
โ”‚   173 โ”‚   model.encoder.finalize()                                                               โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer โ”‚
โ”‚ .py:770 in fit                                                                                   โ”‚
โ”‚                                                                                                  โ”‚
โ”‚    767 โ”‚   โ”‚   โ”‚   datamodule: An instance of :class:`~pytorch_lightning.core.datamodule.Lightn  โ”‚
โ”‚    768 โ”‚   โ”‚   """                                                                               โ”‚
โ”‚    769 โ”‚   โ”‚   self.strategy.model = model                                                       โ”‚
โ”‚ โฑ  770 โ”‚   โ”‚   self._call_and_handle_interrupt(                                                  โ”‚
โ”‚    771 โ”‚   โ”‚   โ”‚   self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_  โ”‚
โ”‚    772 โ”‚   โ”‚   )                                                                                 โ”‚
โ”‚    773                                                                                           โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer โ”‚
โ”‚ .py:721 in _call_and_handle_interrupt                                                            โ”‚
โ”‚                                                                                                  โ”‚
โ”‚    718 โ”‚   โ”‚   """                                                                               โ”‚
โ”‚    719 โ”‚   โ”‚   try:                                                                              โ”‚
โ”‚    720 โ”‚   โ”‚   โ”‚   if self.strategy.launcher is not None:                                        โ”‚
โ”‚ โฑ  721 โ”‚   โ”‚   โ”‚   โ”‚   return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **  โ”‚
โ”‚    722 โ”‚   โ”‚   โ”‚   else:                                                                         โ”‚
โ”‚    723 โ”‚   โ”‚   โ”‚   โ”‚   return trainer_fn(*args, **kwargs)                                        โ”‚
โ”‚    724 โ”‚   โ”‚   # TODO: treat KeyboardInterrupt as BaseException (delete the code below) in v1.7  โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/ray_lightning/launchers/ray_launc โ”‚
โ”‚ her.py:58 in launch                                                                              โ”‚
โ”‚                                                                                                  โ”‚
โ”‚    55 โ”‚   โ”‚   This function is run on the driver process.                                        โ”‚
โ”‚    56 โ”‚   โ”‚   """                                                                                โ”‚
โ”‚    57 โ”‚   โ”‚   self.setup_workers()                                                               โ”‚
โ”‚ โฑ  58 โ”‚   โ”‚   ray_output = self.run_function_on_workers(                                         โ”‚
โ”‚    59 โ”‚   โ”‚   โ”‚   function, *args, trainer=trainer, **kwargs)                                    โ”‚
โ”‚    60 โ”‚   โ”‚                                                                                      โ”‚
โ”‚    61 โ”‚   โ”‚   if trainer is None:                                                                โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/ray_lightning/launchers/ray_launc โ”‚
โ”‚ her.py:249 in run_function_on_workers                                                            โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   246 โ”‚   โ”‚                                                                                      โ”‚
โ”‚   247 โ”‚   โ”‚   trainer.model = model                                                              โ”‚
โ”‚   248 โ”‚   โ”‚                                                                                      โ”‚
โ”‚ โฑ 249 โ”‚   โ”‚   results = process_results(self._futures, self.tune_queue)                          โ”‚
โ”‚   250 โ”‚   โ”‚   return results[0]                                                                  โ”‚
โ”‚   251 โ”‚                                                                                          โ”‚
โ”‚   252 โ”‚   def _wrapping_function(                                                                โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/ray_lightning/util.py:64 in       โ”‚
โ”‚ process_results                                                                                  โ”‚
โ”‚                                                                                                  โ”‚
โ”‚    61 โ”‚   โ”‚   if queue:                                                                          โ”‚
โ”‚    62 โ”‚   โ”‚   โ”‚   _handle_queue(queue)                                                           โ”‚
โ”‚    63 โ”‚   โ”‚   ready, not_ready = ray.wait(not_ready, timeout=0)                                  โ”‚
โ”‚ โฑ  64 โ”‚   โ”‚   ray.get(ready)                                                                     โ”‚
โ”‚    65 โ”‚   ray.get(ready)                                                                         โ”‚
โ”‚    66 โ”‚                                                                                          โ”‚
โ”‚    67 โ”‚   if queue:                                                                              โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/ray/_private/client_mode_hook.py: โ”‚
โ”‚ 105 in wrapper                                                                                   โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   102 โ”‚   โ”‚   โ”‚   # we only convert init function if RAY_CLIENT_MODE=1                           โ”‚
โ”‚   103 โ”‚   โ”‚   โ”‚   if func.__name__ != "init" or is_client_mode_enabled_by_default:               โ”‚
โ”‚   104 โ”‚   โ”‚   โ”‚   โ”‚   return getattr(ray, func.__name__)(*args, **kwargs)                        โ”‚
โ”‚ โฑ 105 โ”‚   โ”‚   return func(*args, **kwargs)                                                       โ”‚
โ”‚   106 โ”‚                                                                                          โ”‚
โ”‚   107 โ”‚   return wrapper                                                                         โ”‚
โ”‚   108                                                                                            โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/ray/_private/worker.py:2380 in    โ”‚
โ”‚ get                                                                                              โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   2377 โ”‚   โ”‚   โ”‚   โ”‚   if isinstance(value, ray.exceptions.ObjectLostError):                     โ”‚
โ”‚   2378 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   worker.core_worker.dump_object_store_memory_usage()                   โ”‚
โ”‚   2379 โ”‚   โ”‚   โ”‚   โ”‚   if isinstance(value, RayTaskError):                                       โ”‚
โ”‚ โฑ 2380 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   raise value.as_instanceof_cause()                                     โ”‚
โ”‚   2381 โ”‚   โ”‚   โ”‚   โ”‚   else:                                                                     โ”‚
โ”‚   2382 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   raise value                                                           โ”‚
โ”‚   2383                                                                                           โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
RayTaskError(RuntimeError): [36mray::RayExecutor.execute()[39m (pid=615244, ip=172.16.96.59, 
repr=<ray_lightning.launchers.utils.RayExecutor object at 0x7f421afc6770>)
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/ray_lightning/launchers/utils.py", 
line 52, in execute
    return fn(*args, **kwargs)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/ray_lightning/launchers/ray_launcher.py", 
line 301, in _wrapping_function
    results = function(*args, **kwargs)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 
811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 
1172, in _run
    self.__setup_profiler()
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 
1797, in __setup_profiler
    self.profiler.setup(stage=self.state.fn._setup_fn, local_rank=local_rank, log_dir=self.log_dir)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 
2249, in log_dir
    dirpath = self.strategy.broadcast(dirpath)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp_spawn.py", 
line 215, in broadcast
    torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line
2084, in broadcast_object_list
    broadcast(object_sizes_tensor, src=src, group=group)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line
1400, in broadcast
    work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: 
/opt/conda/conda-bld/pytorch_1670525541990/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, 
internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Proxy Call to rank 0 failed (Connect)

I donโ€™t see much changesโ€ฆ do i need to run anything else? or could u give me any code which enusres correct working of mutli node gpu torch nccl

It seems the communication fails in:

Proxy Call to rank 0 failed (Connect)

which might be related to e.g. this issue.

2 Likes

an you explain what i need to try for thisโ€ฆ its simply failling to connect is there any command or log file i need to show it to you? will be easy for you to sort the problem.
Actually i tried changing my NCCL_SOCKET_IFNAME=enp3s0 which is out labโ€™s local network but that doest work
Any code to check whether to check port is getting blocked?

I will try to provide you the pytorch distribution output rather than ray โ€ฆ give me 2 min time

while running ray through ray cluster i am getting an extra line as output :

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1670525541990/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL 
version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Net : Connect to 127.0.0.1<42979> failed : Connection refused

connection refused

Sir could you pls check this
This is the output log when i run the torch distributed

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1670525541990/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL 
version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Net : Connect to 172.16.96.60<56713> failed : Connection timed out
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 253472) of binary: /home/windows/miniconda3/envs/ray/bin/python
ERROR:torch.distributed.elastic.agent.server.api:Error waiting on exit barrier. Elapsed: 307.3943808078766 seconds
Traceback (most recent call last):
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 906, in _exit_barrier
    store_util.barrier(
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 78, in barrier
    synchronize(store, data, rank, world_size, key_prefix, barrier_timeout)
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 64, in synchronize
    agent_data = get_all(store, rank, key_prefix, world_size)
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all
    data = store.get(f"{prefix}{idx}")
RuntimeError: Socket Timeout
Traceback (most recent call last):
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
models/EfficientNetb3/AutoEncoder.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-15_20:37:23
  host      : hostssh
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 253472)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
==========================================================

Sorry, I donโ€™t know what needs to change in your setup and why the processes cannot connect, but maybe @kwen2501 might be able to point to towards the issue.

Yeah thanks for you help, just an hour before i found the reason is the iptablesโ€ฆ uninstalling it solved the problem. Thanks for you help. Actually i disabled ufw firewall even then this error continues but uninstalling solved the error

If you see NCCL error in future, it would be helpful to set environment variable NCCL_DEBUG=INFO to see more logs. It would print out the call stack in NCCL that reports this error.

yeah but, i actually running distributed using rayโ€ฆ which didnโ€™t show any NCCL related logsโ€ฆ so i need to run it in plain as pytorch distributed with NCCL_DEBUG=INFO to get the logs.
Thank You @ptrblck @kwen2501 for u r time

how do you run it as pytorch distributed?

1 Like

Same question, but I encountered this error on a single nodeโ€ฆ

ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3 ncclInternalError: Internal check failed.
...
Last error: Proxy Call to rank 1 failed (Setup)