Not be able to run training process with convert_sync_batchnorm

two_Two · June 23, 2021, 8:29am

Hi I have a problem with using convert_sync_batchnorm ,When I was trying to use DDP everything works fine ,but when I turn on the sync_bn mode ,the training process start and get stuck right away…

Here’s some info

pytorch  version 1.8.0
 
#How I run the script : 
python -m torch.distributed.launch  \
--nproc_per_node 4  \ 
--master_addr $master_addr  \ 
--master_port $port  train.py \ 
--train.py
--batch 256  --weights yolov5s.pt --device 0,1,2,3  \
--sync_bn

#How I init :
   #note : opt.local_rank is -1 here 
    if opt.local_rank != -1:
        assert torch.cuda.device_count() > opt.local_rank
        torch.cuda.set_device(opt.local_rank)
        device = torch.device('cuda', opt.local_rank)
        dist.init_process_group(backend='nccl', init_method='env://')

I would like to know why when I trun on the sync batchnormalization and the training process stop at the beginning of training… Thanks

Yanli_Zhao · June 25, 2021, 3:39pm

did not see anything wrong with your init_process_group, also do not think sync batch norm could cause initialization failure in common cases.

do you want to share your reproducible code snippet?

two_Two · July 1, 2021, 7:05am

Sorry for late reply , I actually raised a new issue here yolov5 .