NCCL error when running distributed training

ruka · August 13, 2021, 10:34am

My code used to work in PyTorch 1.6. Recently it was upgraded to 1.9. When I try to do training under distributed mode(but actually I only have 1 PC with 2 GPUs, not several PCs), following error happens, sorry for the long log, I’ve never seen it before and totally lost.

$ python -m torch.distributed.run --standalone --nnodes=1 --nproc_per_node=2 train_dist_2.py
[INFO] 2021-08-13 18:21:14,035 run: Running torch.distributed.run with args: [‘/usr/lib/python3.9/site-packages/torch/distributed/run.py’, ‘–standalone’, ‘–nnodes=1’, ‘–nproc_per_node=2’, ‘train_dist_2.py’]
[INFO] 2021-08-13 18:21:14,036 run:

Rendezvous info:
–rdzv_backend=c10d --rdzv_endpoint=localhost:29400 --rdzv_id=5c6a0ec7-2728-407d-8d25-7dde979518e6

[INFO] 2021-08-13 18:21:14,036 run: Using nproc_per_node=2.

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

[INFO] 2021-08-13 18:21:14,036 api: Starting elastic_operator with launch configs:
entrypoint : train_dist_2.py
min_nodes : 1
max_nodes : 1
nproc_per_node : 2
run_id : 5c6a0ec7-2728-407d-8d25-7dde979518e6
rdzv_backend : c10d
rdzv_endpoint : localhost:29400
rdzv_configs : {‘timeout’: 900}
max_restarts : 3
monitor_interval : 5
log_dir : None
metrics_cfg : {}

[INFO] 2021-08-13 18:21:14,059 c10d_rendezvous_backend: Process 25097 hosts the TCP store for the C10d rendezvous backend.
[INFO] 2021-08-13 18:21:14,060 local_elastic_agent: log directory set to: /tmp/torchelastic_ra_2ujgp/5c6a0ec7-2728-407d-8d25-7dde979518e6_5pfgxyog
[INFO] 2021-08-13 18:21:14,060 api: [default] starting workers for entrypoint: python
[INFO] 2021-08-13 18:21:14,060 api: [default] Rendezvous’ing worker group
[INFO] 2021-08-13 18:21:14,060 dynamic_rendezvous: The node ‘cnn_25097_0’ attempts to join the next round of the rendezvous ‘5c6a0ec7-2728-407d-8d25-7dde979518e6’.
[INFO] 2021-08-13 18:21:14,277 dynamic_rendezvous: The node ‘cnn_25097_0’ has joined round 0 of the rendezvous ‘5c6a0ec7-2728-407d-8d25-7dde979518e6’ as rank 0 in a world of size 1.
/usr/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future.
warnings.warn(
[INFO] 2021-08-13 18:21:14,278 api: [default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=cnn
master_port=36965
group_rank=0
group_world_size=1
local_ranks=[0, 1]
role_ranks=[0, 1]
global_ranks=[0, 1]
role_world_sizes=[2, 2]
global_world_sizes=[2, 2]

[INFO] 2021-08-13 18:21:14,278 api: [default] Starting worker group
[INFO] 2021-08-13 18:21:14,278 init: Setting worker0 reply file to: /tmp/torchelastic_ra_2ujgp/5c6a0ec7-2728-407d-8d25-7dde979518e6_5pfgxyog/attempt_0/0/error.json
[INFO] 2021-08-13 18:21:14,278 init: Setting worker1 reply file to: /tmp/torchelastic_ra_2ujgp/5c6a0ec7-2728-407d-8d25-7dde979518e6_5pfgxyog/attempt_0/1/error.json
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20210813 18:21:23.793745 25104 ProcessGroupNCCL.cpp:480] [Rank 1] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
NCCL_DEBUG: UNSET
I20210813 18:21:23.793756 25154 ProcessGroupNCCL.cpp:580] [Rank 1] NCCL watchdog thread started!
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20210813 18:21:23.801612 25103 ProcessGroupNCCL.cpp:480] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
NCCL_DEBUG: UNSET
I20210813 18:21:23.801617 25157 ProcessGroupNCCL.cpp:580] [Rank 0] NCCL watchdog thread started!
Device check. Running model on cuda

Device check. Running model on cuda

Traceback (most recent call last):
Traceback (most recent call last):
File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 285, in
File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 285, in
main(args)main(args)

File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 135, in main
File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 135, in main
model = nn.parallel.DistributedDataParallel(model, device_ids = [args.local_rank])model = nn.parallel.DistributedDataParallel(model, device_ids = [args.local_rank])

File “/usr/lib/python3.9/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
File “/usr/lib/python3.9/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
dist._verify_model_across_ranks(self.process_group, parameters)dist._verify_model_across_ranks(self.process_group, parameters)

RuntimeErrorRuntimeError: : NCCL error in: …/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 20.9.9
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).NCCL error in: …/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 20.9.9
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

I20210813 18:21:24.402045 25154 ProcessGroupNCCL.cpp:582] [Rank 1] NCCL watchdog thread terminated normally
I20210813 18:21:24.403731 25157 ProcessGroupNCCL.cpp:582] [Rank 0] NCCL watchdog thread terminated normally
[ERROR] 2021-08-13 18:21:29,303 api: failed (exitcode: 1) local_rank: 0 (pid: 25103) of binary: /usr/bin/python
[ERROR] 2021-08-13 18:21:29,303 local_elastic_agent: [default] Worker group failed
[INFO] 2021-08-13 18:21:29,303 api: [default] Worker group FAILED. 3/3 attempts left; will restart worker group
[INFO] 2021-08-13 18:21:29,303 api: [default] Stopping worker group
[INFO] 2021-08-13 18:21:29,303 api: [default] Rendezvous’ing worker group
[INFO] 2021-08-13 18:21:29,303 dynamic_rendezvous: The node ‘cnn_25097_0’ attempts to join the next round of the rendezvous ‘5c6a0ec7-2728-407d-8d25-7dde979518e6’.
[INFO] 2021-08-13 18:21:29,422 dynamic_rendezvous: The node ‘cnn_25097_0’ has joined round 1 of the rendezvous ‘5c6a0ec7-2728-407d-8d25-7dde979518e6’ as rank 0 in a world of size 1.
[INFO] 2021-08-13 18:21:29,423 api: [default] Rendezvous complete for workers. Result:
restart_count=1
master_addr=cnn
master_port=53181
group_rank=0
group_world_size=1
local_ranks=[0, 1]
role_ranks=[0, 1]
global_ranks=[0, 1]
role_world_sizes=[2, 2]
global_world_sizes=[2, 2]

[INFO] 2021-08-13 18:21:29,423 api: [default] Starting worker group
[INFO] 2021-08-13 18:21:29,423 init: Setting worker0 reply file to: /tmp/torchelastic_ra_2ujgp/5c6a0ec7-2728-407d-8d25-7dde979518e6_5pfgxyog/attempt_1/0/error.json
[INFO] 2021-08-13 18:21:29,423 init: Setting worker1 reply file to: /tmp/torchelastic_ra_2ujgp/5c6a0ec7-2728-407d-8d25-7dde979518e6_5pfgxyog/attempt_1/1/error.json
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20210813 18:21:38.944895 25196 ProcessGroupNCCL.cpp:480] [Rank 1] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
NCCL_DEBUG: UNSET
I20210813 18:21:38.944903 25245 ProcessGroupNCCL.cpp:580] [Rank 1] NCCL watchdog thread started!
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20210813 18:21:38.954780 25195 ProcessGroupNCCL.cpp:480] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
NCCL_DEBUG: UNSET
I20210813 18:21:38.954794 25248 ProcessGroupNCCL.cpp:580] [Rank 0] NCCL watchdog thread started!
Device check. Running model on cuda

Device check. Running model on cuda

Traceback (most recent call last):
Traceback (most recent call last):
File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 285, in
File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 285, in
main(args)main(args)

File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 135, in main
File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 135, in main
model = nn.parallel.DistributedDataParallel(model, device_ids = [args.local_rank])model = nn.parallel.DistributedDataParallel(model, device_ids = [args.local_rank])

File “/usr/lib/python3.9/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
File “/usr/lib/python3.9/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
dist._verify_model_across_ranks(self.process_group, parameters)dist._verify_model_across_ranks(self.process_group, parameters)

RuntimeErrorRuntimeError: : NCCL error in: …/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 20.9.9
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).NCCL error in: …/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 20.9.9
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

I20210813 18:21:39.523105 25248 ProcessGroupNCCL.cpp:582] [Rank 0] NCCL watchdog thread terminated normally
I20210813 18:21:39.523314 25245 ProcessGroupNCCL.cpp:582] [Rank 1] NCCL watchdog thread terminated normally
[ERROR] 2021-08-13 18:21:44,447 api: failed (exitcode: 1) local_rank: 0 (pid: 25195) of binary: /usr/bin/python
[ERROR] 2021-08-13 18:21:44,447 local_elastic_agent: [default] Worker group failed
[INFO] 2021-08-13 18:21:44,447 api: [default] Worker group FAILED. 2/3 attempts left; will restart worker group
[INFO] 2021-08-13 18:21:44,447 api: [default] Stopping worker group
[INFO] 2021-08-13 18:21:44,447 api: [default] Rendezvous’ing worker group
[INFO] 2021-08-13 18:21:44,447 dynamic_rendezvous: The node ‘cnn_25097_0’ attempts to join the next round of the rendezvous ‘5c6a0ec7-2728-407d-8d25-7dde979518e6’.
[INFO] 2021-08-13 18:21:44,448 dynamic_rendezvous: The node ‘cnn_25097_0’ has joined round 2 of the rendezvous ‘5c6a0ec7-2728-407d-8d25-7dde979518e6’ as rank 0 in a world of size 1.
[INFO] 2021-08-13 18:21:44,449 api: [default] Rendezvous complete for workers. Result:
restart_count=2
master_addr=cnn
master_port=35757
group_rank=0
group_world_size=1
local_ranks=[0, 1]
role_ranks=[0, 1]
global_ranks=[0, 1]
role_world_sizes=[2, 2]
global_world_sizes=[2, 2]

[INFO] 2021-08-13 18:21:44,449 api: [default] Starting worker group
[INFO] 2021-08-13 18:21:44,449 init: Setting worker0 reply file to: /tmp/torchelastic_ra_2ujgp/5c6a0ec7-2728-407d-8d25-7dde979518e6_5pfgxyog/attempt_2/0/error.json
[INFO] 2021-08-13 18:21:44,449 init: Setting worker1 reply file to: /tmp/torchelastic_ra_2ujgp/5c6a0ec7-2728-407d-8d25-7dde979518e6_5pfgxyog/attempt_2/1/error.json
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20210813 18:21:52.939616 25287 ProcessGroupNCCL.cpp:480] [Rank 1] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
NCCL_DEBUG: UNSET
I20210813 18:21:52.939630 25330 ProcessGroupNCCL.cpp:580] [Rank 1] NCCL watchdog thread started!
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20210813 18:21:52.949156 25286 ProcessGroupNCCL.cpp:480] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
NCCL_DEBUG: UNSET
I20210813 18:21:52.949156 25333 ProcessGroupNCCL.cpp:580] [Rank 0] NCCL watchdog thread started!
Device check. Running model on cuda

Device check. Running model on cuda

Traceback (most recent call last):
Traceback (most recent call last):
File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 285, in
File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 285, in
main(args)main(args)

File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 135, in main
File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 135, in main
model = nn.parallel.DistributedDataParallel(model, device_ids = [args.local_rank])model = nn.parallel.DistributedDataParallel(model, device_ids = [args.local_rank])

File “/usr/lib/python3.9/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
File “/usr/lib/python3.9/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
dist._verify_model_across_ranks(self.process_group, parameters)dist._verify_model_across_ranks(self.process_group, parameters)

RuntimeError: RuntimeErrorNCCL error in: …/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 20.9.9
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).:
NCCL error in: …/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 20.9.9
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
I20210813 18:21:53.513854 25330 ProcessGroupNCCL.cpp:582] [Rank 1] NCCL watchdog thread terminated normally
I20210813 18:21:53.513996 25333 ProcessGroupNCCL.cpp:582] [Rank 0] NCCL watchdog thread terminated normally
[ERROR] 2021-08-13 18:21:54,472 api: failed (exitcode: 1) local_rank: 0 (pid: 25286) of binary: /usr/bin/python
[ERROR] 2021-08-13 18:21:54,472 local_elastic_agent: [default] Worker group failed
[INFO] 2021-08-13 18:21:54,472 api: [default] Worker group FAILED. 1/3 attempts left; will restart worker group
[INFO] 2021-08-13 18:21:54,472 api: [default] Stopping worker group
[INFO] 2021-08-13 18:21:54,472 api: [default] Rendezvous’ing worker group
[INFO] 2021-08-13 18:21:54,472 dynamic_rendezvous: The node ‘cnn_25097_0’ attempts to join the next round of the rendezvous ‘5c6a0ec7-2728-407d-8d25-7dde979518e6’.
[INFO] 2021-08-13 18:21:54,473 dynamic_rendezvous: The node ‘cnn_25097_0’ has joined round 3 of the rendezvous ‘5c6a0ec7-2728-407d-8d25-7dde979518e6’ as rank 0 in a world of size 1.
[INFO] 2021-08-13 18:21:54,474 api: [default] Rendezvous complete for workers. Result:
restart_count=3
master_addr=cnn
master_port=44399
group_rank=0
group_world_size=1
local_ranks=[0, 1]
role_ranks=[0, 1]
global_ranks=[0, 1]
role_world_sizes=[2, 2]
global_world_sizes=[2, 2]

[INFO] 2021-08-13 18:21:54,474 api: [default] Starting worker group
[INFO] 2021-08-13 18:21:54,474 init: Setting worker0 reply file to: /tmp/torchelastic_ra_2ujgp/5c6a0ec7-2728-407d-8d25-7dde979518e6_5pfgxyog/attempt_3/0/error.json
[INFO] 2021-08-13 18:21:54,474 init: Setting worker1 reply file to: /tmp/torchelastic_ra_2ujgp/5c6a0ec7-2728-407d-8d25-7dde979518e6_5pfgxyog/attempt_3/1/error.json
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20210813 18:22:03.975812 25356 ProcessGroupNCCL.cpp:480] [Rank 1] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
NCCL_DEBUG: UNSET
I20210813 18:22:03.975847 25408 ProcessGroupNCCL.cpp:580] [Rank 1] NCCL watchdog thread started!
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20210813 18:22:03.977819 25411 ProcessGroupNCCL.cpp:580] [Rank 0] NCCL watchdog thread started!
I20210813 18:22:03.977841 25355 ProcessGroupNCCL.cpp:480] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
NCCL_DEBUG: UNSET
Device check. Running model on cuda

Device check. Running model on cuda

Traceback (most recent call last):
Traceback (most recent call last):
File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 285, in
File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 285, in
main(args)main(args)

File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 135, in main
File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 135, in main
model = nn.parallel.DistributedDataParallel(model, device_ids = [args.local_rank])model = nn.parallel.DistributedDataParallel(model, device_ids = [args.local_rank])

File “/usr/lib/python3.9/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
File “/usr/lib/python3.9/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
dist._verify_model_across_ranks(self.process_group, parameters)dist._verify_model_across_ranks(self.process_group, parameters)

RuntimeErrorRuntimeError: : NCCL error in: …/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 20.9.9
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).NCCL error in: …/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 20.9.9
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

I20210813 18:22:04.551221 25408 ProcessGroupNCCL.cpp:582] [Rank 1] NCCL watchdog thread terminated normally
I20210813 18:22:04.554879 25411 ProcessGroupNCCL.cpp:582] [Rank 0] NCCL watchdog thread terminated normally
[ERROR] 2021-08-13 18:22:09,499 api: failed (exitcode: 1) local_rank: 0 (pid: 25355) of binary: /usr/bin/python
[ERROR] 2021-08-13 18:22:09,499 local_elastic_agent: [default] Worker group failed
[INFO] 2021-08-13 18:22:09,500 api: Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
/usr/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py:70: FutureWarning: This is an experimental API and will be changed in future.
warnings.warn(
[INFO] 2021-08-13 18:22:09,500 api: Done waiting for other agents. Elapsed: 0.00022149085998535156 seconds
{“name”: “torchelastic.worker.status.FAILED”, “source”: “WORKER”, “timestamp”: 0, “metadata”: {“run_id”: “5c6a0ec7-2728-407d-8d25-7dde979518e6”, “global_rank”: 0, “group_rank”: 0, “worker_id”: “25355”, “role”: “default”, “hostname”: “cnn”, “state”: “FAILED”, “total_run_time”: 55, “rdzv_backend”: “c10d”, “raw_error”: “{"message": ""}”, “metadata”: “{"group_world_size": 1, "entry_point": "python", "local_rank": [0], "role_rank": [0], "role_world_size": [2]}”, “agent_restarts”: 3}}
{“name”: “torchelastic.worker.status.FAILED”, “source”: “WORKER”, “timestamp”: 0, “metadata”: {“run_id”: “5c6a0ec7-2728-407d-8d25-7dde979518e6”, “global_rank”: 1, “group_rank”: 0, “worker_id”: “25356”, “role”: “default”, “hostname”: “cnn”, “state”: “FAILED”, “total_run_time”: 55, “rdzv_backend”: “c10d”, “raw_error”: “{"message": ""}”, “metadata”: “{"group_world_size": 1, "entry_point": "python", "local_rank": [1], "role_rank": [1], "role_world_size": [2]}”, “agent_restarts”: 3}}
{“name”: “torchelastic.worker.status.SUCCEEDED”, “source”: “AGENT”, “timestamp”: 0, “metadata”: {“run_id”: “5c6a0ec7-2728-407d-8d25-7dde979518e6”, “global_rank”: null, “group_rank”: 0, “worker_id”: null, “role”: “default”, “hostname”: “cnn”, “state”: “SUCCEEDED”, “total_run_time”: 55, “rdzv_backend”: “c10d”, “raw_error”: null, “metadata”: “{"group_world_size": 1, "entry_point": "python"}”, “agent_restarts”: 3}}
[INFO] 2021-08-13 18:22:09,501 dynamic_rendezvous: The node ‘cnn_25097_0’ has closed the rendezvous ‘5c6a0ec7-2728-407d-8d25-7dde979518e6’.
/usr/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py:354: UserWarning:
           CHILD PROCESS FAILED WITH NO ERROR_FILE                
CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 25355 (local_rank 0) FAILED (exitcode 1)
Error msg: Process failed with exitcode 1
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:

from torch.distributed.elastic.multiprocessing.errors import record

@record
def trainer_main(args):
# do train

warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
File “/usr/lib/python3.9/runpy.py”, line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/usr/lib/python3.9/runpy.py”, line 87, in _run_code
exec(code, run_globals)
File “/usr/lib/python3.9/site-packages/torch/distributed/run.py”, line 637, in
main()
File “/usr/lib/python3.9/site-packages/torch/distributed/run.py”, line 629, in main
run(args)
File “/usr/lib/python3.9/site-packages/torch/distributed/run.py”, line 621, in run
elastic_launch(
File “/usr/lib/python3.9/site-packages/torch/distributed/launcher/api.py”, line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/usr/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 348, in wrapper
return f(*args, **kwargs)
File “/usr/lib/python3.9/site-packages/torch/distributed/launcher/api.py”, line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
     train_dist_2.py FAILED        
=======================================
Root Cause:
[0]:
time: 2021-08-13_18:22:09
rank: 0 (local_rank: 0)
exitcode: 1 (pid: 25355)
error_file: <N/A>
msg: “Process failed with exitcode 1”

Other Failures:
[1]:
time: 2021-08-13_18:22:09
rank: 1 (local_rank: 1)
exitcode: 1 (pid: 25356)
error_file: <N/A>
msg: “Process failed with exitcode 1”

Any help is appreciated

ptrblck · August 13, 2021, 11:24pm

You could rerun the script with export NCCL_DEBUG=INFO and check the logs for NCCL errors.

rvandeghen · February 4, 2022, 1:23pm

Hi,

I’m using a cluster with multiples nodes with different GPUs (2080, Quadro, Tesla, …) and I get an error only with the quadro cards.
Here is the output with NCCL_DEBUG=INFO for a node without problem:

| distributed init (rank 1): env://
| distributed init (rank 0): env://
| distributed init (rank 2): env://
| distributed init (rank 3): env://
compute-05:2612510:2612510 [0] NCCL INFO Bootstrap : Using enp4s0f0:<0>
compute-05:2612510:2612510 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
compute-05:2612510:2612510 [0] NCCL INFO NET/IB : No device found.
compute-05:2612510:2612510 [0] NCCL INFO NET/Socket : Using [0]enp4s0f0:<0>
compute-05:2612510:2612510 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
compute-05:2612511:2612511 [1] NCCL INFO Bootstrap : Using enp4s0f0:<0>
compute-05:2612513:2612513 [3] NCCL INFO Bootstrap : Using enp4s0f0:<0>
compute-05:2612511:2612511 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
compute-05:2612513:2612513 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
compute-05:2612512:2612512 [2] NCCL INFO Bootstrap : Using enp4s0f0:<0>
compute-05:2612511:2612511 [1] NCCL INFO NET/IB : No device found.
compute-05:2612513:2612513 [3] NCCL INFO NET/IB : No device found.
compute-05:2612512:2612512 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
compute-05:2612511:2612511 [1] NCCL INFO NET/Socket : Using [0]enp4s0f0:<0>
compute-05:2612511:2612511 [1] NCCL INFO Using network Socket
compute-05:2612512:2612512 [2] NCCL INFO NET/IB : No device found.
compute-05:2612513:2612513 [3] NCCL INFO NET/Socket : Using [0]enp4s0f0:<0>
compute-05:2612513:2612513 [3] NCCL INFO Using network Socket
compute-05:2612512:2612512 [2] NCCL INFO NET/Socket : Using [0]enp4s0f0:<0>
compute-05:2612512:2612512 [2] NCCL INFO Using network Socket
compute-05:2612510:2612567 [0] NCCL INFO Channel 00/02 :    0   1   2   3
compute-05:2612511:2612568 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
compute-05:2612510:2612567 [0] NCCL INFO Channel 01/02 :    0   1   2   3
compute-05:2612510:2612567 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
compute-05:2612511:2612568 [1] NCCL INFO Setting affinity for GPU 1 to 070007
compute-05:2612510:2612567 [0] NCCL INFO Setting affinity for GPU 0 to 070007
compute-05:2612512:2612570 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
compute-05:2612513:2612569 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
compute-05:2612511:2612568 [1] NCCL INFO Channel 00 : 1[3000] -> 2[81000] via direct shared memory
compute-05:2612512:2612570 [2] NCCL INFO Channel 00 : 2[81000] -> 3[82000] via direct shared memory
compute-05:2612510:2612567 [0] NCCL INFO Channel 00 : 0[2000] -> 1[3000] via direct shared memory
compute-05:2612511:2612568 [1] NCCL INFO Channel 01 : 1[3000] -> 2[81000] via direct shared memory
compute-05:2612512:2612570 [2] NCCL INFO Channel 01 : 2[81000] -> 3[82000] via direct shared memory
compute-05:2612510:2612567 [0] NCCL INFO Channel 01 : 0[2000] -> 1[3000] via direct shared memory
compute-05:2612513:2612569 [3] NCCL INFO Channel 00 : 3[82000] -> 0[2000] via direct shared memory
compute-05:2612513:2612569 [3] NCCL INFO Channel 01 : 3[82000] -> 0[2000] via direct shared memory
compute-05:2612511:2612568 [1] NCCL INFO Connected all rings
compute-05:2612510:2612567 [0] NCCL INFO Connected all rings
compute-05:2612512:2612570 [2] NCCL INFO Connected all rings
compute-05:2612511:2612568 [1] NCCL INFO Channel 00 : 1[3000] -> 0[2000] via direct shared memory
compute-05:2612513:2612569 [3] NCCL INFO Connected all rings
compute-05:2612511:2612568 [1] NCCL INFO Channel 01 : 1[3000] -> 0[2000] via direct shared memory
compute-05:2612513:2612569 [3] NCCL INFO Channel 00 : 3[82000] -> 2[81000] via direct shared memory
compute-05:2612513:2612569 [3] NCCL INFO Channel 01 : 3[82000] -> 2[81000] via direct shared memory
compute-05:2612510:2612567 [0] NCCL INFO Connected all trees
compute-05:2612510:2612567 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
compute-05:2612510:2612567 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
compute-05:2612512:2612570 [2] NCCL INFO Channel 00 : 2[81000] -> 1[3000] via direct shared memory
compute-05:2612512:2612570 [2] NCCL INFO Channel 01 : 2[81000] -> 1[3000] via direct shared memory
compute-05:2612513:2612569 [3] NCCL INFO Connected all trees
compute-05:2612513:2612569 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
compute-05:2612513:2612569 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
compute-05:2612511:2612568 [1] NCCL INFO Connected all trees
compute-05:2612511:2612568 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
compute-05:2612511:2612568 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
compute-05:2612512:2612570 [2] NCCL INFO Connected all trees
compute-05:2612512:2612570 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
compute-05:2612512:2612570 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
compute-05:2612512:2612570 [2] NCCL INFO comm 0x7f3054002fb0 rank 2 nranks 4 cudaDev 2 busId 81000 - Init COMPLETE
compute-05:2612513:2612569 [3] NCCL INFO comm 0x7f686c002fb0 rank 3 nranks 4 cudaDev 3 busId 82000 - Init COMPLETE
compute-05:2612511:2612568 [1] NCCL INFO comm 0x7f94fc002fb0 rank 1 nranks 4 cudaDev 1 busId 3000 - Init COMPLETE
compute-05:2612510:2612567 [0] NCCL INFO comm 0x7f8410002fb0 rank 0 nranks 4 cudaDev 0 busId 2000 - Init COMPLETE
compute-05:2612510:2612510 [0] NCCL INFO Launch mode Parallel

and the output with the quadro:

| distributed init (rank 2): env://
| distributed init (rank 0): env://
| distributed init (rank 1): env://
| distributed init (rank 3): env://
compute-11:1358800:1358800 [0] NCCL INFO Bootstrap : Using eno1:<0>
compute-11:1358800:1358800 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
compute-11:1358800:1358800 [0] NCCL INFO NET/IB : Using [0]i40iw1:1/RoCE ; OOB eno1:<0>
compute-11:1358800:1358800 [0] NCCL INFO Using network IB
NCCL version 2.10.3+cuda11.3
compute-11:1358801:1358801 [1] NCCL INFO Bootstrap : Using eno1:<0>
compute-11:1358801:1358801 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
compute-11:1358803:1358803 [3] NCCL INFO Bootstrap : Using eno1:<0>
compute-11:1358802:1358802 [2] NCCL INFO Bootstrap : Using eno1:<0>
compute-11:1358802:1358802 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
compute-11:1358803:1358803 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
compute-11:1358801:1358801 [1] NCCL INFO NET/IB : Using [0]i40iw1:1/RoCE ; OOB eno1:<0>
compute-11:1358801:1358801 [1] NCCL INFO Using network IB
compute-11:1358802:1358802 [2] NCCL INFO NET/IB : Using [0]i40iw1:1/RoCE ; OOB eno1:<0>
compute-11:1358802:1358802 [2] NCCL INFO Using network IB
compute-11:1358803:1358803 [3] NCCL INFO NET/IB : Using [0]i40iw1:1/RoCE ; OOB eno1:<0>
compute-11:1358803:1358803 [3] NCCL INFO Using network IB
compute-11:1358800:1358860 [0] NCCL INFO Channel 00/04 :    0   1   2   3
compute-11:1358800:1358860 [0] NCCL INFO Channel 01/04 :    0   3   2   1
compute-11:1358800:1358860 [0] NCCL INFO Channel 02/04 :    0   1   2   3
compute-11:1358800:1358860 [0] NCCL INFO Channel 03/04 :    0   3   2   1
compute-11:1358801:1358862 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] -1/-1/-1->1->2 [2] 2/-1/-1->1->0 [3] -1/-1/-1->1->2
compute-11:1358800:1358860 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 3/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 3/-1/-1->0->-1
compute-11:1358801:1358862 [1] NCCL INFO Setting affinity for GPU 1 to 07,00000007
compute-11:1358800:1358860 [0] NCCL INFO Setting affinity for GPU 0 to 07,00000007
compute-11:1358802:1358865 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 1/-1/-1->2->3 [2] 3/-1/-1->2->1 [3] 1/-1/-1->2->3
compute-11:1358802:1358865 [2] NCCL INFO Setting affinity for GPU 2 to 07,00000007
compute-11:1358803:1358866 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] 2/-1/-1->3->0 [2] -1/-1/-1->3->2 [3] 2/-1/-1->3->0
compute-11:1358803:1358866 [3] NCCL INFO Setting affinity for GPU 3 to 07,00000007
compute-11:1358800:1358860 [0] NCCL INFO Channel 00 : 0[1a000] -> 1[1c000] via P2P/IPC
compute-11:1358801:1358862 [1] NCCL INFO Channel 00 : 1[1c000] -> 2[1d000] via P2P/IPC
compute-11:1358802:1358865 [2] NCCL INFO Channel 00 : 2[1d000] -> 3[1e000] via P2P/IPC
compute-11:1358803:1358866 [3] NCCL INFO Channel 00 : 3[1e000] -> 0[1a000] via P2P/IPC
compute-11:1358800:1358860 [0] NCCL INFO Channel 02 : 0[1a000] -> 1[1c000] via P2P/IPC
compute-11:1358801:1358862 [1] NCCL INFO Channel 02 : 1[1c000] -> 2[1d000] via P2P/IPC
compute-11:1358802:1358865 [2] NCCL INFO Channel 02 : 2[1d000] -> 3[1e000] via P2P/IPC

compute-11:1358801:1358862 [1] transport/p2p.cc:136 NCCL WARN Cuda failure 'API call is not supported in the installed CUDA driver'
compute-11:1358801:1358862 [1] NCCL INFO transport/p2p.cc:238 -> 1
compute-11:1358801:1358862 [1] NCCL INFO transport.cc:111 -> 1
compute-11:1358801:1358862 [1] NCCL INFO init.cc:778 -> 1
compute-11:1358801:1358862 [1] NCCL INFO init.cc:904 -> 1
compute-11:1358801:1358862 [1] NCCL INFO group.cc:72 -> 1 [Async thread]
compute-11:1358803:1358866 [3] NCCL INFO Channel 02 : 3[1e000] -> 0[1a000] via P2P/IPC

compute-11:1358802:1358865 [2] transport/p2p.cc:136 NCCL WARN Cuda failure 'API call is not supported in the installed CUDA driver'
compute-11:1358802:1358865 [2] NCCL INFO transport/p2p.cc:238 -> 1
compute-11:1358802:1358865 [2] NCCL INFO transport.cc:111 -> 1
compute-11:1358802:1358865 [2] NCCL INFO init.cc:778 -> 1
compute-11:1358802:1358865 [2] NCCL INFO init.cc:904 -> 1
compute-11:1358802:1358865 [2] NCCL INFO group.cc:72 -> 1 [Async thread]

compute-11:1358803:1358866 [3] transport/p2p.cc:136 NCCL WARN Cuda failure 'API call is not supported in the installed CUDA driver'
compute-11:1358803:1358866 [3] NCCL INFO transport/p2p.cc:238 -> 1
compute-11:1358803:1358866 [3] NCCL INFO transport.cc:111 -> 1
compute-11:1358803:1358866 [3] NCCL INFO init.cc:778 -> 1
compute-11:1358803:1358866 [3] NCCL INFO init.cc:904 -> 1
compute-11:1358803:1358866 [3] NCCL INFO group.cc:72 -> 1 [Async thread]

compute-11:1358800:1358860 [0] transport/p2p.cc:136 NCCL WARN Cuda failure 'API call is not supported in the installed CUDA driver'
compute-11:1358800:1358860 [0] NCCL INFO transport/p2p.cc:238 -> 1
compute-11:1358800:1358860 [0] NCCL INFO transport.cc:111 -> 1
compute-11:1358800:1358860 [0] NCCL INFO init.cc:778 -> 1
compute-11:1358800:1358860 [0] NCCL INFO init.cc:904 -> 1
compute-11:1358800:1358860 [0] NCCL INFO group.cc:72 -> 1 [Async thread]

and my error is:
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1639180487213/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, unhandled cuda error, NCCL version 21.0.3 ncclUnhandledCudaError: Call to CUDA function failed.
Any idea why the behaviors are different and how to solve it ?

ptrblck · February 4, 2022, 4:29pm

This could point towards a driver issue on your machine, so try to update the driver to match your CUDA toolkit as described here.

rvandeghen · February 8, 2022, 2:34pm

The driver version and CUDA version are the same for all nodes so I don’t think this is where the problem comes from. Have you heard about something which is directly related to Quadro cards ?

ptrblck · February 9, 2022, 1:29am

No, I’m not aware of Quadro-specific issues.

This sounds a bit concerning as you are pointing to identical system setups where one node apparently crashes?
Are the NCCL tests running fine on all nodes (in particular the problematic and a healthy one)?

rvandeghen · February 9, 2022, 9:44am

From the outputs, the first difference happens here:

rvandeghen:

compute-05:2612511:2612568 [1] NCCL INFO Channel 00 : 1[3000] -> 2[81000] via direct shared memory
compute-05:2612512:2612570 [2] NCCL INFO Channel 00 : 2[81000] -> 3[82000] via direct shared memory
compute-05:2612510:2612567 [0] NCCL INFO Channel 00 : 0[2000] -> 1[3000] via direct shared memory
compute-05:2612511:2612568 [1] NCCL INFO Channel 01 : 1[3000] -> 2[81000] via direct shared memory
compute-05:2612512:2612570 [2] NCCL INFO Channel 01 : 2[81000] -> 3[82000] via direct shared memory
compute-05:2612510:2612567 [0] NCCL INFO Channel 01 : 0[2000] -> 1[3000] via direct shared memory
compute-05:2612513:2612569 [3] NCCL INFO Channel 00 : 3[82000] -> 0[2000] via direct shared memory
compute-05:2612513:2612569 [3] NCCL INFO Channel 01 : 3[82000] -> 0[2000] via direct shared memory

for the healthy node and here:

rvandeghen:

compute-11:1358800:1358860 [0] NCCL INFO Channel 00 : 0[1a000] -> 1[1c000] via P2P/IPC
compute-11:1358801:1358862 [1] NCCL INFO Channel 00 : 1[1c000] -> 2[1d000] via P2P/IPC
compute-11:1358802:1358865 [2] NCCL INFO Channel 00 : 2[1d000] -> 3[1e000] via P2P/IPC
compute-11:1358803:1358866 [3] NCCL INFO Channel 00 : 3[1e000] -> 0[1a000] via P2P/IPC
compute-11:1358800:1358860 [0] NCCL INFO Channel 02 : 0[1a000] -> 1[1c000] via P2P/IPC
compute-11:1358801:1358862 [1] NCCL INFO Channel 02 : 1[1c000] -> 2[1d000] via P2P/IPC
compute-11:1358802:1358865 [2] NCCL INFO Channel 02 : 2[1d000] -> 3[1e000] via P2P/IPC

for the problematic one.
From my understanding, the problematic node has a problem with the transport/p2p.cc but this is not needed with the healthy one which uses direct shared memory

I will run the tests and get back with the results.

rvandeghen · March 14, 2022, 3:12pm

Hi, here are some more informations regarding the cluster and my environment:

----------------------  ----------------------------------------------------------------------------------------------
sys.platform            linux
Python                  3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0]
numpy                   1.21.2
PyTorch                 1.10.1 @/home/rvandeghen/anaconda3/envs/SNv3-detection/lib/python3.9/site-packages/torch
PyTorch debug build     False
GPU available           Yes
GPU 0                   Quadro RTX 6000 (arch=7.5)
Driver version          450.57
CUDA_HOME               /home/rvandeghen/anaconda3/envs/SNv3-detection
Pillow                  8.4.0
torchvision             0.11.2 @/home/rvandeghen/anaconda3/envs/SNv3-detection/lib/python3.9/site-packages/torchvision
torchvision arch flags  3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6
cv2                     4.5.4
----------------------  ----------------------------------------------------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 11.6
  - Built with CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.2
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always-faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

Maybe you have more information with the cuda and cudnn version ?

YK711 · August 17, 2022, 1:20pm

I encounter this error:

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1659484809662/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3 ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

I deleted this line, then it worked…

os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"

NCCL error when running distributed training

======================================= Root Cause: [0]: time: 2021-08-13_18:22:09 rank: 0 (local_rank: 0) exitcode: 1 (pid: 25355) error_file: <N/A> msg: “Process failed with exitcode 1”

=======================================
Root Cause:
[0]:
time: 2021-08-13_18:22:09
rank: 0 (local_rank: 0)
exitcode: 1 (pid: 25355)
error_file: <N/A>
msg: “Process failed with exitcode 1”