ncclInternalError: Internal check failed. Proxy Call to rank 0 failed (Connect)

NavinKumarMNK · March 14, 2023, 3:25pm

After setting up ray cluster with 2 nodes of single gpu & also direct pytroch distributed run … with the same nodes i got my distributed process registered. starting with 2 process with backed nccl

NCCL INFO :

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
(RayExecutor pid=423719, ip=172.16.0.2) Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
(RayExecutor pid=508760) ----------------------------------------------------------------------------------------------------
(RayExecutor pid=508760) distributed_backend=nccl
(RayExecutor pid=508760) All distributed processes registered. Starting with 2 processes
(RayExecutor pid=508760) ----------------------------------------------------------------------------------------------------
(RayExecutor pid=508760) 
(RayExecutor pid=508760) GPU available: True (cuda), used: True (Please ignore the previous info [GPU used: False]).
(RayExecutor pid=508760) hostssh:508760:508760 [0] NCCL INFO Bootstrap : Using enp3s0:172.16.96.59<0>
(RayExecutor pid=508760) hostssh:508760:508760 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
(RayExecutor pid=508760) hostssh:508760:508760 [0] NCCL INFO cudaDriverVersion 11070
(RayExecutor pid=508760) NCCL version 2.14.3+cuda11.7

But as soon as this message i am getting an nccInternalError : Internal check failed

RayTaskError(RuntimeError): [36mray::RayExecutor.execute()[39m (pid=508760, ip=172.16.96.59, 
repr=<ray_lightning.launchers.utils.RayExecutor object at 0x7fa16a4327d0>)
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/ray_lightning/launchers/utils.py", 
line 52, in execute
    return fn(*args, **kwargs)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/ray_lightning/launchers/ray_launcher.py", 
line 301, in _wrapping_function
    results = function(*args, **kwargs)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 
811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 
1172, in _run
    self.__setup_profiler()
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 
1797, in __setup_profiler
    self.profiler.setup(stage=self.state.fn._setup_fn, local_rank=local_rank, log_dir=self.log_dir)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 
2249, in log_dir
    dirpath = self.strategy.broadcast(dirpath)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp_spawn.py", 
line 215, in broadcast
    torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line
2084, in broadcast_object_list
    broadcast(object_sizes_tensor, src=src, group=group)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line
1400, in broadcast
    work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: 
/opt/conda/conda-bld/pytorch_1670525541990/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, 
internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Proxy Call to rank 0 failed (Connect)
(ray) windows@hostssh:~/Video-Detection$ nvidia-smi
Tue Mar 14 20:40:29 2023

I am running in on premise cluster without any containerization . And sinlge gpu code works successfully (with 16 batch size). so i need to do model parallel

ptrblck · March 14, 2023, 4:50pm

Could you rerun your code with NCCL_DEBUG=INFO as well as TORCH_CPP_LOG_LEVEL=INFO and TORCH_DISTRIBUTED_DEBUG=INFO to get more information about the error, please?

NavinKumarMNK · March 14, 2023, 4:54pm

(ray) windows@hostssh:~/Video-Detection$ NCCL_DEBUG=INFO TORCH_CPP_LOG_LEVEL=INFO TORCH_DISTRIBUTED_DEBUG=INFO python3 models/EfficientNetb3/AutoEncoder.py
[I debug.cpp:49] [c10d] The debug level is set to INFO.
2023-03-14 22:23:21,750 INFO worker.py:1364 -- Connecting to existing Ray cluster at address: 172.16.96.59:6379...
2023-03-14 22:23:21,753 INFO worker.py:1544 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8000 
{'max_epochs': 100, 'weights_summary': 'full', 'precision': 16, 'gradient_clip_val': 0.0, 'auto_lr_find': True, 'auto_scale_batch_size': True, 'check_val_every_n_epoch': 1, 'fast_dev_run': False, 'enable_progress_bar': True, 'detect_anomaly': True}
1
Using 16bit native Automatic Mixed Precision (AMP)
Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.model_summary.ModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
(RayExecutor pid=615244) /home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/ray_lightning/launchers/utils.py:48: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
(RayExecutor pid=615244) Use get_node_id() instead
(RayExecutor pid=615244)   return ray.get_runtime_context().node_id.hex(), ray.get_gpu_ids()
(RayExecutor pid=427230, ip=172.16.0.2) /home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/ray_lightning/launchers/utils.py:48: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
(RayExecutor pid=427230, ip=172.16.0.2) Use get_node_id() instead
(RayExecutor pid=427230, ip=172.16.0.2)   return ray.get_runtime_context().node_id.hex(), ray.get_gpu_ids()
(RayExecutor pid=615244) /home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/utilities/warnings.py:53: LightningDeprecationWarning: pytorch_lightning.utilities.warnings.rank_zero_deprecation has been deprecated in v1.6 and will be removed in v1.8. Use the equivalent function from the pytorch_lightning.utilities.rank_zero module instead.
(RayExecutor pid=615244)   new_rank_zero_deprecation(
(RayExecutor pid=615244) /home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/utilities/warnings.py:58: LightningDeprecationWarning: ParallelStrategy.torch_distributed_backend was deprecated in v1.6 and will be removed in v1.8.
(RayExecutor pid=615244)   return new_rank_zero_deprecation(*args, **kwargs)
(RayExecutor pid=615244) Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
(RayExecutor pid=427230, ip=172.16.0.2) Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
(RayExecutor pid=615244) hostssh:615244:615244 [0] NCCL INFO Bootstrap : Using enp3s0:172.16.96.59<0>
(RayExecutor pid=615244) hostssh:615244:615244 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
(RayExecutor pid=615244) hostssh:615244:615244 [0] NCCL INFO cudaDriverVersion 11070
(RayExecutor pid=615244) NCCL version 2.14.3+cuda11.7
(RayExecutor pid=615244) ----------------------------------------------------------------------------------------------------
(RayExecutor pid=615244) distributed_backend=nccl
(RayExecutor pid=615244) All distributed processes registered. Starting with 2 processes
(RayExecutor pid=615244) ----------------------------------------------------------------------------------------------------
(RayExecutor pid=615244) 
(RayExecutor pid=615244) GPU available: True (cuda), used: True (Please ignore the previous info [GPU used: False]).
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/windows/Video-Detection/models/EfficientNetb3/AutoEncoder.py:170 in <module>               │
│                                                                                                  │
│   167 │   │   │   │   │   strategy=strategy                                                      │
│   168 │   │   │   │   │   )                                                                      │
│   169 │                                                                                          │
│ ❱ 170 │   trainer.fit(model, dataset)                                                            │
│   171 │                                                                                          │
│   172 │                                                                                          │
│   173 │   model.encoder.finalize()                                                               │
│                                                                                                  │
│ /home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer │
│ .py:770 in fit                                                                                   │
│                                                                                                  │
│    767 │   │   │   datamodule: An instance of :class:`~pytorch_lightning.core.datamodule.Lightn  │
│    768 │   │   """                                                                               │
│    769 │   │   self.strategy.model = model                                                       │
│ ❱  770 │   │   self._call_and_handle_interrupt(                                                  │
│    771 │   │   │   self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_  │
│    772 │   │   )                                                                                 │
│    773                                                                                           │
│                                                                                                  │
│ /home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer │
│ .py:721 in _call_and_handle_interrupt                                                            │
│                                                                                                  │
│    718 │   │   """                                                                               │
│    719 │   │   try:                                                                              │
│    720 │   │   │   if self.strategy.launcher is not None:                                        │
│ ❱  721 │   │   │   │   return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **  │
│    722 │   │   │   else:                                                                         │
│    723 │   │   │   │   return trainer_fn(*args, **kwargs)                                        │
│    724 │   │   # TODO: treat KeyboardInterrupt as BaseException (delete the code below) in v1.7  │
│                                                                                                  │
│ /home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/ray_lightning/launchers/ray_launc │
│ her.py:58 in launch                                                                              │
│                                                                                                  │
│    55 │   │   This function is run on the driver process.                                        │
│    56 │   │   """                                                                                │
│    57 │   │   self.setup_workers()                                                               │
│ ❱  58 │   │   ray_output = self.run_function_on_workers(                                         │
│    59 │   │   │   function, *args, trainer=trainer, **kwargs)                                    │
│    60 │   │                                                                                      │
│    61 │   │   if trainer is None:                                                                │
│                                                                                                  │
│ /home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/ray_lightning/launchers/ray_launc │
│ her.py:249 in run_function_on_workers                                                            │
│                                                                                                  │
│   246 │   │                                                                                      │
│   247 │   │   trainer.model = model                                                              │
│   248 │   │                                                                                      │
│ ❱ 249 │   │   results = process_results(self._futures, self.tune_queue)                          │
│   250 │   │   return results[0]                                                                  │
│   251 │                                                                                          │
│   252 │   def _wrapping_function(                                                                │
│                                                                                                  │
│ /home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/ray_lightning/util.py:64 in       │
│ process_results                                                                                  │
│                                                                                                  │
│    61 │   │   if queue:                                                                          │
│    62 │   │   │   _handle_queue(queue)                                                           │
│    63 │   │   ready, not_ready = ray.wait(not_ready, timeout=0)                                  │
│ ❱  64 │   │   ray.get(ready)                                                                     │
│    65 │   ray.get(ready)                                                                         │
│    66 │                                                                                          │
│    67 │   if queue:                                                                              │
│                                                                                                  │
│ /home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/ray/_private/client_mode_hook.py: │
│ 105 in wrapper                                                                                   │
│                                                                                                  │
│   102 │   │   │   # we only convert init function if RAY_CLIENT_MODE=1                           │
│   103 │   │   │   if func.__name__ != "init" or is_client_mode_enabled_by_default:               │
│   104 │   │   │   │   return getattr(ray, func.__name__)(*args, **kwargs)                        │
│ ❱ 105 │   │   return func(*args, **kwargs)                                                       │
│   106 │                                                                                          │
│   107 │   return wrapper                                                                         │
│   108                                                                                            │
│                                                                                                  │
│ /home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/ray/_private/worker.py:2380 in    │
│ get                                                                                              │
│                                                                                                  │
│   2377 │   │   │   │   if isinstance(value, ray.exceptions.ObjectLostError):                     │
│   2378 │   │   │   │   │   worker.core_worker.dump_object_store_memory_usage()                   │
│   2379 │   │   │   │   if isinstance(value, RayTaskError):                                       │
│ ❱ 2380 │   │   │   │   │   raise value.as_instanceof_cause()                                     │
│   2381 │   │   │   │   else:                                                                     │
│   2382 │   │   │   │   │   raise value                                                           │
│   2383                                                                                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RayTaskError(RuntimeError): [36mray::RayExecutor.execute()[39m (pid=615244, ip=172.16.96.59, 
repr=<ray_lightning.launchers.utils.RayExecutor object at 0x7f421afc6770>)
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/ray_lightning/launchers/utils.py", 
line 52, in execute
    return fn(*args, **kwargs)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/ray_lightning/launchers/ray_launcher.py", 
line 301, in _wrapping_function
    results = function(*args, **kwargs)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 
811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 
1172, in _run
    self.__setup_profiler()
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 
1797, in __setup_profiler
    self.profiler.setup(stage=self.state.fn._setup_fn, local_rank=local_rank, log_dir=self.log_dir)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 
2249, in log_dir
    dirpath = self.strategy.broadcast(dirpath)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp_spawn.py", 
line 215, in broadcast
    torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line
2084, in broadcast_object_list
    broadcast(object_sizes_tensor, src=src, group=group)
  File 
"/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line
1400, in broadcast
    work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: 
/opt/conda/conda-bld/pytorch_1670525541990/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, 
internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Proxy Call to rank 0 failed (Connect)

I don’t see much changes… do i need to run anything else? or could u give me any code which enusres correct working of mutli node gpu torch nccl

ptrblck · March 14, 2023, 4:59pm

It seems the communication fails in:

Proxy Call to rank 0 failed (Connect)

which might be related to e.g. this issue.

NavinKumarMNK · March 14, 2023, 5:05pm

an you explain what i need to try for this… its simply failling to connect is there any command or log file i need to show it to you? will be easy for you to sort the problem.
Actually i tried changing my NCCL_SOCKET_IFNAME=enp3s0 which is out lab’s local network but that doest work
Any code to check whether to check port is getting blocked?

I will try to provide you the pytorch distribution output rather than ray … give me 2 min time

NavinKumarMNK · March 14, 2023, 6:17pm

while running ray through ray cluster i am getting an extra line as output :

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1670525541990/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL 
version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Net : Connect to 127.0.0.1<42979> failed : Connection refused

connection refused

NavinKumarMNK · March 15, 2023, 3:34pm

Sir could you pls check this
This is the output log when i run the torch distributed

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1670525541990/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL 
version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Net : Connect to 172.16.96.60<56713> failed : Connection timed out
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 253472) of binary: /home/windows/miniconda3/envs/ray/bin/python
ERROR:torch.distributed.elastic.agent.server.api:Error waiting on exit barrier. Elapsed: 307.3943808078766 seconds
Traceback (most recent call last):
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 906, in _exit_barrier
    store_util.barrier(
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 78, in barrier
    synchronize(store, data, rank, world_size, key_prefix, barrier_timeout)
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 64, in synchronize
    agent_data = get_all(store, rank, key_prefix, world_size)
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all
    data = store.get(f"{prefix}{idx}")
RuntimeError: Socket Timeout
Traceback (most recent call last):
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/windows/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
models/EfficientNetb3/AutoEncoder.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-15_20:37:23
  host      : hostssh
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 253472)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
==========================================================

ptrblck · March 16, 2023, 5:25am

Sorry, I don’t know what needs to change in your setup and why the processes cannot connect, but maybe @kwen2501 might be able to point to towards the issue.

NavinKumarMNK · March 16, 2023, 5:27am

Yeah thanks for you help, just an hour before i found the reason is the iptables… uninstalling it solved the problem. Thanks for you help. Actually i disabled ufw firewall even then this error continues but uninstalling solved the error

kwen2501 · March 16, 2023, 4:33pm

If you see NCCL error in future, it would be helpful to set environment variable NCCL_DEBUG=INFO to see more logs. It would print out the call stack in NCCL that reports this error.

NavinKumarMNK · March 16, 2023, 10:20pm

yeah but, i actually running distributed using ray… which didn’t show any NCCL related logs… so i need to run it in plain as pytorch distributed with NCCL_DEBUG=INFO to get the logs.
Thank You @ptrblck @kwen2501 for u r time

zhangjunyi · August 19, 2023, 6:18am

how do you run it as pytorch distributed?

Jimskns · February 1, 2024, 11:25am

Same question, but I encountered this error on a single node…

ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3 ncclInternalError: Internal check failed.
...
Last error: Proxy Call to rank 1 failed (Setup)