Hi I have a problem with using convert_sync_batchnorm
,When I was trying to use DDP everything works fine ,but when I turn on the sync_bn mode ,the training process start and get stuck right away…
Here’s some info
pytorch version 1.8.0
#How I run the script :
python -m torch.distributed.launch \
--nproc_per_node 4 \
--master_addr $master_addr \
--master_port $port train.py \
--train.py
--batch 256 --weights yolov5s.pt --device 0,1,2,3 \
--sync_bn
#How I init :
#note : opt.local_rank is -1 here
if opt.local_rank != -1:
assert torch.cuda.device_count() > opt.local_rank
torch.cuda.set_device(opt.local_rank)
device = torch.device('cuda', opt.local_rank)
dist.init_process_group(backend='nccl', init_method='env://')
I would like to know why when I trun on the sync batchnormalization and the training process stop at the beginning of training… Thanks