I additionally found out that I get no problem using only single GPU, but for above situation, I’m using DataParallel with 2 GPUs. According to DataParallel example (https://pytorch.org/tutorials/beginner/former_torchies/parallelism_tutorial.html), half of inputs goes to cuda:0 and the other goes to cuda:1.
How can I adjust implemented bathnorm with DataParallel?