Torch.distributed.barrier stuck with cuda 11.4

Hi, I have a problem with using torch.distributed with CUDA 11.4 environment(Nvidia rtx A6000).

It seems that torch.distributed and DDP don’t work properly. I test with simple code below.

import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel
import os
def get_local_rank() -> int:
    return int(os.environ.get('LOCAL_RANK', 0))
def init_distributed() -> bool:
    world_size = int(os.environ.get('WORLD_SIZE', 1))
    distributed = world_size > 1
    if distributed:
        backend = 'nccl' if torch.cuda.is_available() else 'gloo'
        dist.init_process_group(backend=backend, init_method='env://')
        if backend == 'nccl':
            print ('set nccl correctly!')
            torch.cuda.set_device(get_local_rank())
        else:
            logging.warning('Running on CPU only!')
        assert torch.distributed.is_initialized()
    return distributed
dist_stat=init_distributed()
local_rank = get_local_rank()
print ('hi',local_rank)
dist.barrier(device_ids=[get_local_rank()]) 
print ('bye',local_rank)

Running above code with command as below,
export CUDA_VISIBLE_DEVICES=0,1,2,3
python -u -m torch.distributed.run --nnodes=1 --nproc_per_node=gpu test_dist.py

then I got message:
``
set nccl correctly!
set nccl correctly!set nccl correctly!

set nccl correctly!
hi 0
hi 1
hi 2
hi 3
``
And stuck at dist.barrier() forever. However, when I use only two gpus, everything is fine. What should I look for in order to solve the problem?

The version of Python is 3.9.7 and torch is 1.10.1+cu113

I experience the same issue with a very similar setup.