I’m using DDP(one process per GPU) to training a 3D UNet. I transfered all batchnorm layer inside network to syncbatchnorm with nn.SyncBatchNorm.convert_sync_batchnorm. When doing validation at the end of every training epoch on rank 0, it always freeze at same validation steps. I think it is becau…

Could you update to the latest stable release or the nightly binary and check, if you are still facing the error? 1.1.0 is quite old by now and this issue might have been already fixed.

Validation hangs up when using DDP and syncbatchnorm

distributed

sunshichen (Shichen) December 14, 2020, 6:19am 10

Actually. I met another problem after I upgrade to V1.7.0. The result cames to be much worse than it on 1.1. Could you help me with that?