My server is 4xGTX1080TI, when i use DistributedDataParallel with sync batchnorm on 4 gpus, it will lead to gpu lost. But when i use apex sync bathnorm, the program works fine. And the sync batchborm run on 3 gpus it can works fine also. what’s the problem? how can i deal with it?
Could you explain a bit, what “GPU lost” means?
Is a GPU disappearing from the system, i.e. is it not detectable in
Yes. The Server requires to reboot for using GPU. I switch the SyncBatchNorm to BatchNorm, the situation will disappear.