NCCL error when running distributed training

My code used to work in PyTorch 1.6. Recently it was upgraded to 1.9. When I try to do training under distributed mode(but actually I only have 1 PC with 2 GPUs, not several PCs), following error happens, sorry for the long log, I’ve never seen it before and totally lost.

$ python -m torch.distributed.run --standalone --nnodes=1 --nproc_per_node=2 train_dist_2.py
[INFO] 2021-08-13 18:21:14,035 run: Running torch.distributed.run with args: [‘/usr/lib/python3.9/site-packages/torch/distributed/run.py’, ‘–standalone’, ‘–nnodes=1’, ‘–nproc_per_node=2’, ‘train_dist_2.py’]
[INFO] 2021-08-13 18:21:14,036 run:


Rendezvous info:
–rdzv_backend=c10d --rdzv_endpoint=localhost:29400 --rdzv_id=5c6a0ec7-2728-407d-8d25-7dde979518e6


[INFO] 2021-08-13 18:21:14,036 run: Using nproc_per_node=2.


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


[INFO] 2021-08-13 18:21:14,036 api: Starting elastic_operator with launch configs:
entrypoint : train_dist_2.py
min_nodes : 1
max_nodes : 1
nproc_per_node : 2
run_id : 5c6a0ec7-2728-407d-8d25-7dde979518e6
rdzv_backend : c10d
rdzv_endpoint : localhost:29400
rdzv_configs : {‘timeout’: 900}
max_restarts : 3
monitor_interval : 5
log_dir : None
metrics_cfg : {}

[INFO] 2021-08-13 18:21:14,059 c10d_rendezvous_backend: Process 25097 hosts the TCP store for the C10d rendezvous backend.
[INFO] 2021-08-13 18:21:14,060 local_elastic_agent: log directory set to: /tmp/torchelastic_ra_2ujgp/5c6a0ec7-2728-407d-8d25-7dde979518e6_5pfgxyog
[INFO] 2021-08-13 18:21:14,060 api: [default] starting workers for entrypoint: python
[INFO] 2021-08-13 18:21:14,060 api: [default] Rendezvous’ing worker group
[INFO] 2021-08-13 18:21:14,060 dynamic_rendezvous: The node ‘cnn_25097_0’ attempts to join the next round of the rendezvous ‘5c6a0ec7-2728-407d-8d25-7dde979518e6’.
[INFO] 2021-08-13 18:21:14,277 dynamic_rendezvous: The node ‘cnn_25097_0’ has joined round 0 of the rendezvous ‘5c6a0ec7-2728-407d-8d25-7dde979518e6’ as rank 0 in a world of size 1.
/usr/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future.
warnings.warn(
[INFO] 2021-08-13 18:21:14,278 api: [default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=cnn
master_port=36965
group_rank=0
group_world_size=1
local_ranks=[0, 1]
role_ranks=[0, 1]
global_ranks=[0, 1]
role_world_sizes=[2, 2]
global_world_sizes=[2, 2]

[INFO] 2021-08-13 18:21:14,278 api: [default] Starting worker group
[INFO] 2021-08-13 18:21:14,278 init: Setting worker0 reply file to: /tmp/torchelastic_ra_2ujgp/5c6a0ec7-2728-407d-8d25-7dde979518e6_5pfgxyog/attempt_0/0/error.json
[INFO] 2021-08-13 18:21:14,278 init: Setting worker1 reply file to: /tmp/torchelastic_ra_2ujgp/5c6a0ec7-2728-407d-8d25-7dde979518e6_5pfgxyog/attempt_0/1/error.json
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20210813 18:21:23.793745 25104 ProcessGroupNCCL.cpp:480] [Rank 1] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
NCCL_DEBUG: UNSET
I20210813 18:21:23.793756 25154 ProcessGroupNCCL.cpp:580] [Rank 1] NCCL watchdog thread started!
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20210813 18:21:23.801612 25103 ProcessGroupNCCL.cpp:480] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
NCCL_DEBUG: UNSET
I20210813 18:21:23.801617 25157 ProcessGroupNCCL.cpp:580] [Rank 0] NCCL watchdog thread started!
Device check. Running model on cuda

Device check. Running model on cuda

Traceback (most recent call last):
Traceback (most recent call last):
File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 285, in
File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 285, in
main(args)main(args)

File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 135, in main
File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 135, in main
model = nn.parallel.DistributedDataParallel(model, device_ids = [args.local_rank])model = nn.parallel.DistributedDataParallel(model, device_ids = [args.local_rank])

File “/usr/lib/python3.9/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
File “/usr/lib/python3.9/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
dist._verify_model_across_ranks(self.process_group, parameters)dist._verify_model_across_ranks(self.process_group, parameters)

RuntimeErrorRuntimeError: : NCCL error in: …/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 20.9.9
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).NCCL error in: …/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 20.9.9
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

I20210813 18:21:24.402045 25154 ProcessGroupNCCL.cpp:582] [Rank 1] NCCL watchdog thread terminated normally
I20210813 18:21:24.403731 25157 ProcessGroupNCCL.cpp:582] [Rank 0] NCCL watchdog thread terminated normally
[ERROR] 2021-08-13 18:21:29,303 api: failed (exitcode: 1) local_rank: 0 (pid: 25103) of binary: /usr/bin/python
[ERROR] 2021-08-13 18:21:29,303 local_elastic_agent: [default] Worker group failed
[INFO] 2021-08-13 18:21:29,303 api: [default] Worker group FAILED. 3/3 attempts left; will restart worker group
[INFO] 2021-08-13 18:21:29,303 api: [default] Stopping worker group
[INFO] 2021-08-13 18:21:29,303 api: [default] Rendezvous’ing worker group
[INFO] 2021-08-13 18:21:29,303 dynamic_rendezvous: The node ‘cnn_25097_0’ attempts to join the next round of the rendezvous ‘5c6a0ec7-2728-407d-8d25-7dde979518e6’.
[INFO] 2021-08-13 18:21:29,422 dynamic_rendezvous: The node ‘cnn_25097_0’ has joined round 1 of the rendezvous ‘5c6a0ec7-2728-407d-8d25-7dde979518e6’ as rank 0 in a world of size 1.
[INFO] 2021-08-13 18:21:29,423 api: [default] Rendezvous complete for workers. Result:
restart_count=1
master_addr=cnn
master_port=53181
group_rank=0
group_world_size=1
local_ranks=[0, 1]
role_ranks=[0, 1]
global_ranks=[0, 1]
role_world_sizes=[2, 2]
global_world_sizes=[2, 2]

[INFO] 2021-08-13 18:21:29,423 api: [default] Starting worker group
[INFO] 2021-08-13 18:21:29,423 init: Setting worker0 reply file to: /tmp/torchelastic_ra_2ujgp/5c6a0ec7-2728-407d-8d25-7dde979518e6_5pfgxyog/attempt_1/0/error.json
[INFO] 2021-08-13 18:21:29,423 init: Setting worker1 reply file to: /tmp/torchelastic_ra_2ujgp/5c6a0ec7-2728-407d-8d25-7dde979518e6_5pfgxyog/attempt_1/1/error.json
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20210813 18:21:38.944895 25196 ProcessGroupNCCL.cpp:480] [Rank 1] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
NCCL_DEBUG: UNSET
I20210813 18:21:38.944903 25245 ProcessGroupNCCL.cpp:580] [Rank 1] NCCL watchdog thread started!
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20210813 18:21:38.954780 25195 ProcessGroupNCCL.cpp:480] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
NCCL_DEBUG: UNSET
I20210813 18:21:38.954794 25248 ProcessGroupNCCL.cpp:580] [Rank 0] NCCL watchdog thread started!
Device check. Running model on cuda

Device check. Running model on cuda

Traceback (most recent call last):
Traceback (most recent call last):
File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 285, in
File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 285, in
main(args)main(args)

File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 135, in main
File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 135, in main
model = nn.parallel.DistributedDataParallel(model, device_ids = [args.local_rank])model = nn.parallel.DistributedDataParallel(model, device_ids = [args.local_rank])

File “/usr/lib/python3.9/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
File “/usr/lib/python3.9/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
dist._verify_model_across_ranks(self.process_group, parameters)dist._verify_model_across_ranks(self.process_group, parameters)

RuntimeErrorRuntimeError: : NCCL error in: …/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 20.9.9
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).NCCL error in: …/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 20.9.9
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

I20210813 18:21:39.523105 25248 ProcessGroupNCCL.cpp:582] [Rank 0] NCCL watchdog thread terminated normally
I20210813 18:21:39.523314 25245 ProcessGroupNCCL.cpp:582] [Rank 1] NCCL watchdog thread terminated normally
[ERROR] 2021-08-13 18:21:44,447 api: failed (exitcode: 1) local_rank: 0 (pid: 25195) of binary: /usr/bin/python
[ERROR] 2021-08-13 18:21:44,447 local_elastic_agent: [default] Worker group failed
[INFO] 2021-08-13 18:21:44,447 api: [default] Worker group FAILED. 2/3 attempts left; will restart worker group
[INFO] 2021-08-13 18:21:44,447 api: [default] Stopping worker group
[INFO] 2021-08-13 18:21:44,447 api: [default] Rendezvous’ing worker group
[INFO] 2021-08-13 18:21:44,447 dynamic_rendezvous: The node ‘cnn_25097_0’ attempts to join the next round of the rendezvous ‘5c6a0ec7-2728-407d-8d25-7dde979518e6’.
[INFO] 2021-08-13 18:21:44,448 dynamic_rendezvous: The node ‘cnn_25097_0’ has joined round 2 of the rendezvous ‘5c6a0ec7-2728-407d-8d25-7dde979518e6’ as rank 0 in a world of size 1.
[INFO] 2021-08-13 18:21:44,449 api: [default] Rendezvous complete for workers. Result:
restart_count=2
master_addr=cnn
master_port=35757
group_rank=0
group_world_size=1
local_ranks=[0, 1]
role_ranks=[0, 1]
global_ranks=[0, 1]
role_world_sizes=[2, 2]
global_world_sizes=[2, 2]

[INFO] 2021-08-13 18:21:44,449 api: [default] Starting worker group
[INFO] 2021-08-13 18:21:44,449 init: Setting worker0 reply file to: /tmp/torchelastic_ra_2ujgp/5c6a0ec7-2728-407d-8d25-7dde979518e6_5pfgxyog/attempt_2/0/error.json
[INFO] 2021-08-13 18:21:44,449 init: Setting worker1 reply file to: /tmp/torchelastic_ra_2ujgp/5c6a0ec7-2728-407d-8d25-7dde979518e6_5pfgxyog/attempt_2/1/error.json
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20210813 18:21:52.939616 25287 ProcessGroupNCCL.cpp:480] [Rank 1] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
NCCL_DEBUG: UNSET
I20210813 18:21:52.939630 25330 ProcessGroupNCCL.cpp:580] [Rank 1] NCCL watchdog thread started!
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20210813 18:21:52.949156 25286 ProcessGroupNCCL.cpp:480] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
NCCL_DEBUG: UNSET
I20210813 18:21:52.949156 25333 ProcessGroupNCCL.cpp:580] [Rank 0] NCCL watchdog thread started!
Device check. Running model on cuda

Device check. Running model on cuda

Traceback (most recent call last):
Traceback (most recent call last):
File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 285, in
File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 285, in
main(args)main(args)

File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 135, in main
File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 135, in main
model = nn.parallel.DistributedDataParallel(model, device_ids = [args.local_rank])model = nn.parallel.DistributedDataParallel(model, device_ids = [args.local_rank])

File “/usr/lib/python3.9/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
File “/usr/lib/python3.9/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
dist._verify_model_across_ranks(self.process_group, parameters)dist._verify_model_across_ranks(self.process_group, parameters)

RuntimeError: RuntimeErrorNCCL error in: …/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 20.9.9
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).:
NCCL error in: …/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 20.9.9
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
I20210813 18:21:53.513854 25330 ProcessGroupNCCL.cpp:582] [Rank 1] NCCL watchdog thread terminated normally
I20210813 18:21:53.513996 25333 ProcessGroupNCCL.cpp:582] [Rank 0] NCCL watchdog thread terminated normally
[ERROR] 2021-08-13 18:21:54,472 api: failed (exitcode: 1) local_rank: 0 (pid: 25286) of binary: /usr/bin/python
[ERROR] 2021-08-13 18:21:54,472 local_elastic_agent: [default] Worker group failed
[INFO] 2021-08-13 18:21:54,472 api: [default] Worker group FAILED. 1/3 attempts left; will restart worker group
[INFO] 2021-08-13 18:21:54,472 api: [default] Stopping worker group
[INFO] 2021-08-13 18:21:54,472 api: [default] Rendezvous’ing worker group
[INFO] 2021-08-13 18:21:54,472 dynamic_rendezvous: The node ‘cnn_25097_0’ attempts to join the next round of the rendezvous ‘5c6a0ec7-2728-407d-8d25-7dde979518e6’.
[INFO] 2021-08-13 18:21:54,473 dynamic_rendezvous: The node ‘cnn_25097_0’ has joined round 3 of the rendezvous ‘5c6a0ec7-2728-407d-8d25-7dde979518e6’ as rank 0 in a world of size 1.
[INFO] 2021-08-13 18:21:54,474 api: [default] Rendezvous complete for workers. Result:
restart_count=3
master_addr=cnn
master_port=44399
group_rank=0
group_world_size=1
local_ranks=[0, 1]
role_ranks=[0, 1]
global_ranks=[0, 1]
role_world_sizes=[2, 2]
global_world_sizes=[2, 2]

[INFO] 2021-08-13 18:21:54,474 api: [default] Starting worker group
[INFO] 2021-08-13 18:21:54,474 init: Setting worker0 reply file to: /tmp/torchelastic_ra_2ujgp/5c6a0ec7-2728-407d-8d25-7dde979518e6_5pfgxyog/attempt_3/0/error.json
[INFO] 2021-08-13 18:21:54,474 init: Setting worker1 reply file to: /tmp/torchelastic_ra_2ujgp/5c6a0ec7-2728-407d-8d25-7dde979518e6_5pfgxyog/attempt_3/1/error.json
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20210813 18:22:03.975812 25356 ProcessGroupNCCL.cpp:480] [Rank 1] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
NCCL_DEBUG: UNSET
I20210813 18:22:03.975847 25408 ProcessGroupNCCL.cpp:580] [Rank 1] NCCL watchdog thread started!
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20210813 18:22:03.977819 25411 ProcessGroupNCCL.cpp:580] [Rank 0] NCCL watchdog thread started!
I20210813 18:22:03.977841 25355 ProcessGroupNCCL.cpp:480] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
NCCL_DEBUG: UNSET
Device check. Running model on cuda

Device check. Running model on cuda

Traceback (most recent call last):
Traceback (most recent call last):
File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 285, in
File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 285, in
main(args)main(args)

File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 135, in main
File “/home/xxx/Desktop/MobileFaceNet-PyTorch/train_dist_2.py”, line 135, in main
model = nn.parallel.DistributedDataParallel(model, device_ids = [args.local_rank])model = nn.parallel.DistributedDataParallel(model, device_ids = [args.local_rank])

File “/usr/lib/python3.9/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
File “/usr/lib/python3.9/site-packages/torch/nn/parallel/distributed.py”, line 496, in init
dist._verify_model_across_ranks(self.process_group, parameters)dist._verify_model_across_ranks(self.process_group, parameters)

RuntimeErrorRuntimeError: : NCCL error in: …/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 20.9.9
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).NCCL error in: …/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 20.9.9
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

I20210813 18:22:04.551221 25408 ProcessGroupNCCL.cpp:582] [Rank 1] NCCL watchdog thread terminated normally
I20210813 18:22:04.554879 25411 ProcessGroupNCCL.cpp:582] [Rank 0] NCCL watchdog thread terminated normally
[ERROR] 2021-08-13 18:22:09,499 api: failed (exitcode: 1) local_rank: 0 (pid: 25355) of binary: /usr/bin/python
[ERROR] 2021-08-13 18:22:09,499 local_elastic_agent: [default] Worker group failed
[INFO] 2021-08-13 18:22:09,500 api: Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
/usr/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py:70: FutureWarning: This is an experimental API and will be changed in future.
warnings.warn(
[INFO] 2021-08-13 18:22:09,500 api: Done waiting for other agents. Elapsed: 0.00022149085998535156 seconds
{“name”: “torchelastic.worker.status.FAILED”, “source”: “WORKER”, “timestamp”: 0, “metadata”: {“run_id”: “5c6a0ec7-2728-407d-8d25-7dde979518e6”, “global_rank”: 0, “group_rank”: 0, “worker_id”: “25355”, “role”: “default”, “hostname”: “cnn”, “state”: “FAILED”, “total_run_time”: 55, “rdzv_backend”: “c10d”, “raw_error”: “{"message": ""}”, “metadata”: “{"group_world_size": 1, "entry_point": "python", "local_rank": [0], "role_rank": [0], "role_world_size": [2]}”, “agent_restarts”: 3}}
{“name”: “torchelastic.worker.status.FAILED”, “source”: “WORKER”, “timestamp”: 0, “metadata”: {“run_id”: “5c6a0ec7-2728-407d-8d25-7dde979518e6”, “global_rank”: 1, “group_rank”: 0, “worker_id”: “25356”, “role”: “default”, “hostname”: “cnn”, “state”: “FAILED”, “total_run_time”: 55, “rdzv_backend”: “c10d”, “raw_error”: “{"message": ""}”, “metadata”: “{"group_world_size": 1, "entry_point": "python", "local_rank": [1], "role_rank": [1], "role_world_size": [2]}”, “agent_restarts”: 3}}
{“name”: “torchelastic.worker.status.SUCCEEDED”, “source”: “AGENT”, “timestamp”: 0, “metadata”: {“run_id”: “5c6a0ec7-2728-407d-8d25-7dde979518e6”, “global_rank”: null, “group_rank”: 0, “worker_id”: null, “role”: “default”, “hostname”: “cnn”, “state”: “SUCCEEDED”, “total_run_time”: 55, “rdzv_backend”: “c10d”, “raw_error”: null, “metadata”: “{"group_world_size": 1, "entry_point": "python"}”, “agent_restarts”: 3}}
[INFO] 2021-08-13 18:22:09,501 dynamic_rendezvous: The node ‘cnn_25097_0’ has closed the rendezvous ‘5c6a0ec7-2728-407d-8d25-7dde979518e6’.
/usr/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py:354: UserWarning:


           CHILD PROCESS FAILED WITH NO ERROR_FILE                

CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 25355 (local_rank 0) FAILED (exitcode 1)
Error msg: Process failed with exitcode 1
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:

from torch.distributed.elastic.multiprocessing.errors import record

@record
def trainer_main(args):
# do train


warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
File “/usr/lib/python3.9/runpy.py”, line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/usr/lib/python3.9/runpy.py”, line 87, in _run_code
exec(code, run_globals)
File “/usr/lib/python3.9/site-packages/torch/distributed/run.py”, line 637, in
main()
File “/usr/lib/python3.9/site-packages/torch/distributed/run.py”, line 629, in main
run(args)
File “/usr/lib/python3.9/site-packages/torch/distributed/run.py”, line 621, in run
elastic_launch(
File “/usr/lib/python3.9/site-packages/torch/distributed/launcher/api.py”, line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/usr/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 348, in wrapper
return f(*args, **kwargs)
File “/usr/lib/python3.9/site-packages/torch/distributed/launcher/api.py”, line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:


     train_dist_2.py FAILED        

=======================================
Root Cause:
[0]:
time: 2021-08-13_18:22:09
rank: 0 (local_rank: 0)
exitcode: 1 (pid: 25355)
error_file: <N/A>
msg: “Process failed with exitcode 1”

Other Failures:
[1]:
time: 2021-08-13_18:22:09
rank: 1 (local_rank: 1)
exitcode: 1 (pid: 25356)
error_file: <N/A>
msg: “Process failed with exitcode 1”


Any help is appreciated

You could rerun the script with export NCCL_DEBUG=INFO and check the logs for NCCL errors.

Hi,

I’m using a cluster with multiples nodes with different GPUs (2080, Quadro, Tesla, …) and I get an error only with the quadro cards.
Here is the output with NCCL_DEBUG=INFO for a node without problem:

| distributed init (rank 1): env://
| distributed init (rank 0): env://
| distributed init (rank 2): env://
| distributed init (rank 3): env://
compute-05:2612510:2612510 [0] NCCL INFO Bootstrap : Using enp4s0f0:<0>
compute-05:2612510:2612510 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
compute-05:2612510:2612510 [0] NCCL INFO NET/IB : No device found.
compute-05:2612510:2612510 [0] NCCL INFO NET/Socket : Using [0]enp4s0f0:<0>
compute-05:2612510:2612510 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
compute-05:2612511:2612511 [1] NCCL INFO Bootstrap : Using enp4s0f0:<0>
compute-05:2612513:2612513 [3] NCCL INFO Bootstrap : Using enp4s0f0:<0>
compute-05:2612511:2612511 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
compute-05:2612513:2612513 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
compute-05:2612512:2612512 [2] NCCL INFO Bootstrap : Using enp4s0f0:<0>
compute-05:2612511:2612511 [1] NCCL INFO NET/IB : No device found.
compute-05:2612513:2612513 [3] NCCL INFO NET/IB : No device found.
compute-05:2612512:2612512 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
compute-05:2612511:2612511 [1] NCCL INFO NET/Socket : Using [0]enp4s0f0:<0>
compute-05:2612511:2612511 [1] NCCL INFO Using network Socket
compute-05:2612512:2612512 [2] NCCL INFO NET/IB : No device found.
compute-05:2612513:2612513 [3] NCCL INFO NET/Socket : Using [0]enp4s0f0:<0>
compute-05:2612513:2612513 [3] NCCL INFO Using network Socket
compute-05:2612512:2612512 [2] NCCL INFO NET/Socket : Using [0]enp4s0f0:<0>
compute-05:2612512:2612512 [2] NCCL INFO Using network Socket
compute-05:2612510:2612567 [0] NCCL INFO Channel 00/02 :    0   1   2   3
compute-05:2612511:2612568 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
compute-05:2612510:2612567 [0] NCCL INFO Channel 01/02 :    0   1   2   3
compute-05:2612510:2612567 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
compute-05:2612511:2612568 [1] NCCL INFO Setting affinity for GPU 1 to 070007
compute-05:2612510:2612567 [0] NCCL INFO Setting affinity for GPU 0 to 070007
compute-05:2612512:2612570 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
compute-05:2612513:2612569 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
compute-05:2612511:2612568 [1] NCCL INFO Channel 00 : 1[3000] -> 2[81000] via direct shared memory
compute-05:2612512:2612570 [2] NCCL INFO Channel 00 : 2[81000] -> 3[82000] via direct shared memory
compute-05:2612510:2612567 [0] NCCL INFO Channel 00 : 0[2000] -> 1[3000] via direct shared memory
compute-05:2612511:2612568 [1] NCCL INFO Channel 01 : 1[3000] -> 2[81000] via direct shared memory
compute-05:2612512:2612570 [2] NCCL INFO Channel 01 : 2[81000] -> 3[82000] via direct shared memory
compute-05:2612510:2612567 [0] NCCL INFO Channel 01 : 0[2000] -> 1[3000] via direct shared memory
compute-05:2612513:2612569 [3] NCCL INFO Channel 00 : 3[82000] -> 0[2000] via direct shared memory
compute-05:2612513:2612569 [3] NCCL INFO Channel 01 : 3[82000] -> 0[2000] via direct shared memory
compute-05:2612511:2612568 [1] NCCL INFO Connected all rings
compute-05:2612510:2612567 [0] NCCL INFO Connected all rings
compute-05:2612512:2612570 [2] NCCL INFO Connected all rings
compute-05:2612511:2612568 [1] NCCL INFO Channel 00 : 1[3000] -> 0[2000] via direct shared memory
compute-05:2612513:2612569 [3] NCCL INFO Connected all rings
compute-05:2612511:2612568 [1] NCCL INFO Channel 01 : 1[3000] -> 0[2000] via direct shared memory
compute-05:2612513:2612569 [3] NCCL INFO Channel 00 : 3[82000] -> 2[81000] via direct shared memory
compute-05:2612513:2612569 [3] NCCL INFO Channel 01 : 3[82000] -> 2[81000] via direct shared memory
compute-05:2612510:2612567 [0] NCCL INFO Connected all trees
compute-05:2612510:2612567 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
compute-05:2612510:2612567 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
compute-05:2612512:2612570 [2] NCCL INFO Channel 00 : 2[81000] -> 1[3000] via direct shared memory
compute-05:2612512:2612570 [2] NCCL INFO Channel 01 : 2[81000] -> 1[3000] via direct shared memory
compute-05:2612513:2612569 [3] NCCL INFO Connected all trees
compute-05:2612513:2612569 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
compute-05:2612513:2612569 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
compute-05:2612511:2612568 [1] NCCL INFO Connected all trees
compute-05:2612511:2612568 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
compute-05:2612511:2612568 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
compute-05:2612512:2612570 [2] NCCL INFO Connected all trees
compute-05:2612512:2612570 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
compute-05:2612512:2612570 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
compute-05:2612512:2612570 [2] NCCL INFO comm 0x7f3054002fb0 rank 2 nranks 4 cudaDev 2 busId 81000 - Init COMPLETE
compute-05:2612513:2612569 [3] NCCL INFO comm 0x7f686c002fb0 rank 3 nranks 4 cudaDev 3 busId 82000 - Init COMPLETE
compute-05:2612511:2612568 [1] NCCL INFO comm 0x7f94fc002fb0 rank 1 nranks 4 cudaDev 1 busId 3000 - Init COMPLETE
compute-05:2612510:2612567 [0] NCCL INFO comm 0x7f8410002fb0 rank 0 nranks 4 cudaDev 0 busId 2000 - Init COMPLETE
compute-05:2612510:2612510 [0] NCCL INFO Launch mode Parallel

and the output with the quadro:

| distributed init (rank 2): env://
| distributed init (rank 0): env://
| distributed init (rank 1): env://
| distributed init (rank 3): env://
compute-11:1358800:1358800 [0] NCCL INFO Bootstrap : Using eno1:<0>
compute-11:1358800:1358800 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
compute-11:1358800:1358800 [0] NCCL INFO NET/IB : Using [0]i40iw1:1/RoCE ; OOB eno1:<0>
compute-11:1358800:1358800 [0] NCCL INFO Using network IB
NCCL version 2.10.3+cuda11.3
compute-11:1358801:1358801 [1] NCCL INFO Bootstrap : Using eno1:<0>
compute-11:1358801:1358801 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
compute-11:1358803:1358803 [3] NCCL INFO Bootstrap : Using eno1:<0>
compute-11:1358802:1358802 [2] NCCL INFO Bootstrap : Using eno1:<0>
compute-11:1358802:1358802 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
compute-11:1358803:1358803 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
compute-11:1358801:1358801 [1] NCCL INFO NET/IB : Using [0]i40iw1:1/RoCE ; OOB eno1:<0>
compute-11:1358801:1358801 [1] NCCL INFO Using network IB
compute-11:1358802:1358802 [2] NCCL INFO NET/IB : Using [0]i40iw1:1/RoCE ; OOB eno1:<0>
compute-11:1358802:1358802 [2] NCCL INFO Using network IB
compute-11:1358803:1358803 [3] NCCL INFO NET/IB : Using [0]i40iw1:1/RoCE ; OOB eno1:<0>
compute-11:1358803:1358803 [3] NCCL INFO Using network IB
compute-11:1358800:1358860 [0] NCCL INFO Channel 00/04 :    0   1   2   3
compute-11:1358800:1358860 [0] NCCL INFO Channel 01/04 :    0   3   2   1
compute-11:1358800:1358860 [0] NCCL INFO Channel 02/04 :    0   1   2   3
compute-11:1358800:1358860 [0] NCCL INFO Channel 03/04 :    0   3   2   1
compute-11:1358801:1358862 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] -1/-1/-1->1->2 [2] 2/-1/-1->1->0 [3] -1/-1/-1->1->2
compute-11:1358800:1358860 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 3/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 3/-1/-1->0->-1
compute-11:1358801:1358862 [1] NCCL INFO Setting affinity for GPU 1 to 07,00000007
compute-11:1358800:1358860 [0] NCCL INFO Setting affinity for GPU 0 to 07,00000007
compute-11:1358802:1358865 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 1/-1/-1->2->3 [2] 3/-1/-1->2->1 [3] 1/-1/-1->2->3
compute-11:1358802:1358865 [2] NCCL INFO Setting affinity for GPU 2 to 07,00000007
compute-11:1358803:1358866 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] 2/-1/-1->3->0 [2] -1/-1/-1->3->2 [3] 2/-1/-1->3->0
compute-11:1358803:1358866 [3] NCCL INFO Setting affinity for GPU 3 to 07,00000007
compute-11:1358800:1358860 [0] NCCL INFO Channel 00 : 0[1a000] -> 1[1c000] via P2P/IPC
compute-11:1358801:1358862 [1] NCCL INFO Channel 00 : 1[1c000] -> 2[1d000] via P2P/IPC
compute-11:1358802:1358865 [2] NCCL INFO Channel 00 : 2[1d000] -> 3[1e000] via P2P/IPC
compute-11:1358803:1358866 [3] NCCL INFO Channel 00 : 3[1e000] -> 0[1a000] via P2P/IPC
compute-11:1358800:1358860 [0] NCCL INFO Channel 02 : 0[1a000] -> 1[1c000] via P2P/IPC
compute-11:1358801:1358862 [1] NCCL INFO Channel 02 : 1[1c000] -> 2[1d000] via P2P/IPC
compute-11:1358802:1358865 [2] NCCL INFO Channel 02 : 2[1d000] -> 3[1e000] via P2P/IPC

compute-11:1358801:1358862 [1] transport/p2p.cc:136 NCCL WARN Cuda failure 'API call is not supported in the installed CUDA driver'
compute-11:1358801:1358862 [1] NCCL INFO transport/p2p.cc:238 -> 1
compute-11:1358801:1358862 [1] NCCL INFO transport.cc:111 -> 1
compute-11:1358801:1358862 [1] NCCL INFO init.cc:778 -> 1
compute-11:1358801:1358862 [1] NCCL INFO init.cc:904 -> 1
compute-11:1358801:1358862 [1] NCCL INFO group.cc:72 -> 1 [Async thread]
compute-11:1358803:1358866 [3] NCCL INFO Channel 02 : 3[1e000] -> 0[1a000] via P2P/IPC

compute-11:1358802:1358865 [2] transport/p2p.cc:136 NCCL WARN Cuda failure 'API call is not supported in the installed CUDA driver'
compute-11:1358802:1358865 [2] NCCL INFO transport/p2p.cc:238 -> 1
compute-11:1358802:1358865 [2] NCCL INFO transport.cc:111 -> 1
compute-11:1358802:1358865 [2] NCCL INFO init.cc:778 -> 1
compute-11:1358802:1358865 [2] NCCL INFO init.cc:904 -> 1
compute-11:1358802:1358865 [2] NCCL INFO group.cc:72 -> 1 [Async thread]

compute-11:1358803:1358866 [3] transport/p2p.cc:136 NCCL WARN Cuda failure 'API call is not supported in the installed CUDA driver'
compute-11:1358803:1358866 [3] NCCL INFO transport/p2p.cc:238 -> 1
compute-11:1358803:1358866 [3] NCCL INFO transport.cc:111 -> 1
compute-11:1358803:1358866 [3] NCCL INFO init.cc:778 -> 1
compute-11:1358803:1358866 [3] NCCL INFO init.cc:904 -> 1
compute-11:1358803:1358866 [3] NCCL INFO group.cc:72 -> 1 [Async thread]

compute-11:1358800:1358860 [0] transport/p2p.cc:136 NCCL WARN Cuda failure 'API call is not supported in the installed CUDA driver'
compute-11:1358800:1358860 [0] NCCL INFO transport/p2p.cc:238 -> 1
compute-11:1358800:1358860 [0] NCCL INFO transport.cc:111 -> 1
compute-11:1358800:1358860 [0] NCCL INFO init.cc:778 -> 1
compute-11:1358800:1358860 [0] NCCL INFO init.cc:904 -> 1
compute-11:1358800:1358860 [0] NCCL INFO group.cc:72 -> 1 [Async thread]

and my error is:
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1639180487213/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, unhandled cuda error, NCCL version 21.0.3 ncclUnhandledCudaError: Call to CUDA function failed.
Any idea why the behaviors are different and how to solve it ?

This could point towards a driver issue on your machine, so try to update the driver to match your CUDA toolkit as described here.

The driver version and CUDA version are the same for all nodes so I don’t think this is where the problem comes from. Have you heard about something which is directly related to Quadro cards ?

No, I’m not aware of Quadro-specific issues.

This sounds a bit concerning as you are pointing to identical system setups where one node apparently crashes?
Are the NCCL tests running fine on all nodes (in particular the problematic and a healthy one)?

From the outputs, the first difference happens here:

for the healthy node and here:

for the problematic one.
From my understanding, the problematic node has a problem with the transport/p2p.cc but this is not needed with the healthy one which uses direct shared memory

I will run the tests and get back with the results.

Hi, here are some more informations regarding the cluster and my environment:

----------------------  ----------------------------------------------------------------------------------------------
sys.platform            linux
Python                  3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0]
numpy                   1.21.2
PyTorch                 1.10.1 @/home/rvandeghen/anaconda3/envs/SNv3-detection/lib/python3.9/site-packages/torch
PyTorch debug build     False
GPU available           Yes
GPU 0                   Quadro RTX 6000 (arch=7.5)
Driver version          450.57
CUDA_HOME               /home/rvandeghen/anaconda3/envs/SNv3-detection
Pillow                  8.4.0
torchvision             0.11.2 @/home/rvandeghen/anaconda3/envs/SNv3-detection/lib/python3.9/site-packages/torchvision
torchvision arch flags  3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6
cv2                     4.5.4
----------------------  ----------------------------------------------------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 11.6
  - Built with CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.2
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always-faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

Maybe you have more information with the cuda and cudnn version ?

I encounter this error:

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1659484809662/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3 ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

I deleted this line, then it worked…

os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"
1 Like