NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp

I am trying to do distributed training with PyTorch and encountered a problem.

*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
| distributed init (rank 2): env://
| distributed init (rank 1): env://
| distributed init (rank 3): env://
| distributed init (rank 0): env://
yq01-sys-hic-k8s-k40-0163:3412:3412 [0] NCCL INFO Bootstrap : Using [0]xgbe0:10.88.150.11<0>
yq01-sys-hic-k8s-k40-0163:3412:3412 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
yq01-sys-hic-k8s-k40-0163:3412:3412 [0] NCCL INFO NET/IB : No device found.
yq01-sys-hic-k8s-k40-0163:3412:3412 [0] NCCL INFO NET/Socket : Using [0]xgbe0:10.88.150.11<0>
yq01-sys-hic-k8s-k40-0163:3412:3412 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda10.2
yq01-sys-hic-k8s-k40-0163:3413:3413 [1] NCCL INFO Bootstrap : Using [0]xgbe0:10.88.150.11<0>
yq01-sys-hic-k8s-k40-0163:3413:3413 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
yq01-sys-hic-k8s-k40-0163:3413:3413 [1] NCCL INFO NET/IB : No device found.
yq01-sys-hic-k8s-k40-0163:3413:3413 [1] NCCL INFO NET/Socket : Using [0]xgbe0:10.88.150.11<0>
yq01-sys-hic-k8s-k40-0163:3413:3413 [1] NCCL INFO Using network Socket
yq01-sys-hic-k8s-k40-0163:3414:3414 [2] NCCL INFO Bootstrap : Using [0]xgbe0:10.88.150.11<0>
yq01-sys-hic-k8s-k40-0163:3414:3414 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
yq01-sys-hic-k8s-k40-0163:3414:3414 [2] NCCL INFO NET/IB : No device found.
yq01-sys-hic-k8s-k40-0163:3414:3414 [2] NCCL INFO NET/Socket : Using [0]xgbe0:10.88.150.11<0>
yq01-sys-hic-k8s-k40-0163:3414:3414 [2] NCCL INFO Using network Socket
yq01-sys-hic-k8s-k40-0163:3415:3415 [3] NCCL INFO Bootstrap : Using [0]xgbe0:10.88.150.11<0>
yq01-sys-hic-k8s-k40-0163:3415:3415 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
yq01-sys-hic-k8s-k40-0163:3415:3415 [3] NCCL INFO NET/IB : No device found.
yq01-sys-hic-k8s-k40-0163:3415:3415 [3] NCCL INFO NET/Socket : Using [0]xgbe0:10.88.150.11<0>
yq01-sys-hic-k8s-k40-0163:3415:3415 [3] NCCL INFO Using network Socket
yq01-sys-hic-k8s-k40-0163:3415:3441 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/64
yq01-sys-hic-k8s-k40-0163:3412:3438 [0] NCCL INFO Channel 00/02 :    0   1   2   3
yq01-sys-hic-k8s-k40-0163:3415:3441 [3] NCCL INFO Trees [0] -1/-1/-1->3->2|2->3->-1/-1/-1 [1] -1/-1/-1->3->2|2->3->-1/-1/-1
yq01-sys-hic-k8s-k40-0163:3412:3438 [0] NCCL INFO Channel 01/02 :    0   1   2   3
yq01-sys-hic-k8s-k40-0163:3412:3438 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/64
yq01-sys-hic-k8s-k40-0163:3412:3438 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->-1|-1->0->1/-1/-1
yq01-sys-hic-k8s-k40-0163:3412:3438 [0] NCCL INFO Setting affinity for GPU 0 to 3f
yq01-sys-hic-k8s-k40-0163:3415:3441 [3] NCCL INFO Setting affinity for GPU 3 to 0fc0
yq01-sys-hic-k8s-k40-0163:3413:3439 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/64
yq01-sys-hic-k8s-k40-0163:3413:3439 [1] NCCL INFO Trees [0] 2/-1/-1->1->0|0->1->2/-1/-1 [1] 2/-1/-1->1->0|0->1->2/-1/-1
yq01-sys-hic-k8s-k40-0163:3413:3439 [1] NCCL INFO Setting affinity for GPU 1 to 3f
yq01-sys-hic-k8s-k40-0163:3414:3440 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/64
yq01-sys-hic-k8s-k40-0163:3414:3440 [2] NCCL INFO Trees [0] 3/-1/-1->2->1|1->2->3/-1/-1 [1] 3/-1/-1->2->1|1->2->3/-1/-1
yq01-sys-hic-k8s-k40-0163:3414:3440 [2] NCCL INFO Setting affinity for GPU 2 to 0fc0
yq01-sys-hic-k8s-k40-0163:3412:3438 [0] NCCL INFO Channel 00 : 0[3000] -> 1[4000] via direct shared memory
yq01-sys-hic-k8s-k40-0163:3413:3439 [1] NCCL INFO Channel 00 : 1[4000] -> 2[83000] via direct shared memory
yq01-sys-hic-k8s-k40-0163:3415:3441 [3] NCCL INFO Channel 00 : 3[84000] -> 0[3000] via direct shared memory
yq01-sys-hic-k8s-k40-0163:3414:3440 [2] NCCL INFO Channel 00 : 2[83000] -> 3[84000] via direct shared memory
yq01-sys-hic-k8s-k40-0163:3415:3441 [3] NCCL INFO Channel 00 : 3[84000] -> 2[83000] via direct shared memory
yq01-sys-hic-k8s-k40-0163:3413:3439 [1] NCCL INFO Channel 00 : 1[4000] -> 0[3000] via direct shared memory
yq01-sys-hic-k8s-k40-0163:3412:3438 [0] NCCL INFO Channel 01 : 0[3000] -> 1[4000] via direct shared memory
yq01-sys-hic-k8s-k40-0163:3415:3441 [3] NCCL INFO Channel 01 : 3[84000] -> 0[3000] via direct shared memory
yq01-sys-hic-k8s-k40-0163:3414:3440 [2] NCCL INFO Channel 00 : 2[83000] -> 1[4000] via direct shared memory
yq01-sys-hic-k8s-k40-0163:3413:3439 [1] NCCL INFO Channel 01 : 1[4000] -> 2[83000] via direct shared memory
yq01-sys-hic-k8s-k40-0163:3414:3440 [2] NCCL INFO Channel 01 : 2[83000] -> 3[84000] via direct shared memory
yq01-sys-hic-k8s-k40-0163:3415:3441 [3] NCCL INFO Channel 01 : 3[84000] -> 2[83000] via direct shared memory
yq01-sys-hic-k8s-k40-0163:3413:3439 [1] NCCL INFO Channel 01 : 1[4000] -> 0[3000] via direct shared memory
yq01-sys-hic-k8s-k40-0163:3415:3441 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
yq01-sys-hic-k8s-k40-0163:3412:3438 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
yq01-sys-hic-k8s-k40-0163:3415:3441 [3] NCCL INFO comm 0x7f40b8000e00 rank 3 nranks 4 cudaDev 3 busId 84000 - Init COMPLETE
yq01-sys-hic-k8s-k40-0163:3414:3440 [2] NCCL INFO Channel 01 : 2[83000] -> 1[4000] via direct shared memory

yq01-sys-hic-k8s-k40-0163:3415:3415 [3] enqueue.cc:215 NCCL WARN Cuda failure 'invalid device function'
yq01-sys-hic-k8s-k40-0163:3415:3415 [3] NCCL INFO group.cc:282 -> 1
yq01-sys-hic-k8s-k40-0163:3413:3439 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
Traceback (most recent call last):
  File "main.py", line 420, in <module>
    main(args)yq01-sys-hic-k8s-k40-0163:3412:3438 [0] NCCL INFO comm 0x7f5388000e00 rank 0 nranks 4 cudaDev 0 busId 3000 - Init COMPLETE

  File "main.py", line 172, in main
    utils.init_distributed_mode(args)
  File "/root/paddlejob/workspace/deit/utils.py", line 236, in init_distributed_mode
yq01-sys-hic-k8s-k40-0163:3412:3412 [0] NCCL INFO Launch mode Parallel
    world_size=args.world_size, rank=args.rank)
  File "/usr/local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
    barrier()
  File "/usr/local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier

yq01-sys-hic-k8s-k40-0163:3412:3412 [0] enqueue.cc:215 NCCL WARN Cuda failure 'invalid device function'
yq01-sys-hic-k8s-k40-0163:3412:3412 [0] NCCL INFO group.cc:282 -> 1
    work = _default_pg.barrier()
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:31, unhandled cuda error, NCCL version 2.7.8
Traceback (most recent call last):
  File "main.py", line 420, in <module>
    main(args)
  File "main.py", line 172, in main
    utils.init_distributed_mode(args)
  File "/root/paddlejob/workspace/deit/utils.py", line 236, in init_distributed_mode
    world_size=args.world_size, rank=args.rank)
  File "/usr/local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
    barrier()
  File "/usr/local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
yq01-sys-hic-k8s-k40-0163:3414:3440 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
    work = _default_pg.barrier()
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:31, unhandled cuda error, NCCL version 2.7.8
yq01-sys-hic-k8s-k40-0163:3413:3439 [1] NCCL INFO comm 0x7fd948000e00 rank 1 nranks 4 cudaDev 1 busId 4000 - Init COMPLETE

yq01-sys-hic-k8s-k40-0163:3413:3413 [1] enqueue.cc:215 NCCL WARN Cuda failure 'invalid device function'
yq01-sys-hic-k8s-k40-0163:3413:3413 [1] NCCL INFO group.cc:282 -> 1
Traceback (most recent call last):
  File "main.py", line 420, in <module>
    main(args)
  File "main.py", line 172, in main
    utils.init_distributed_mode(args)
  File "/root/paddlejob/workspace/deit/utils.py", line 236, in init_distributed_mode
    world_size=args.world_size, rank=args.rank)
  File "/usr/local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
    barrier()
  File "/usr/local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
    work = _default_pg.barrier()
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:31, unhandled cuda error, NCCL version 2.7.8
yq01-sys-hic-k8s-k40-0163:3414:3440 [2] NCCL INFO comm 0x7fe3b0000e00 rank 2 nranks 4 cudaDev 2 busId 83000 - Init COMPLETE

yq01-sys-hic-k8s-k40-0163:3414:3414 [2] enqueue.cc:215 NCCL WARN Cuda failure 'invalid device function'
yq01-sys-hic-k8s-k40-0163:3414:3414 [2] NCCL INFO group.cc:282 -> 1
Traceback (most recent call last):
  File "main.py", line 420, in <module>
    main(args)
  File "main.py", line 172, in main
    utils.init_distributed_mode(args)
  File "/root/paddlejob/workspace/deit/utils.py", line 236, in init_distributed_mode
    world_size=args.world_size, rank=args.rank)
  File "/usr/local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
    barrier()
  File "/usr/local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
    work = _default_pg.barrier()
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:31, unhandled cuda error, NCCL version 2.7.8
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in <module>
    main()
  File "/usr/local/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/_internal/cpython-3.7.0/bin/python3', '-u', 'main.py', '--model', 'deit_small_patch16_224', '--batch-size', '64', '--data-path', 'data/ILSVRC2012', '--output_dir', 'checkpoints']' returned non-zero exit status 1.

Full environment:

node: 1
gpus: 4
gpu: k40m
pytorch version: 1.7.1

@itisianlee Could you share your complete training script that reproduces the problem?