Yeah. Batch normalization is tricky to get right in multi-gpu setting. This is mainly because BN requires calculating mini-batch mean and thus require information of tensors on other gpus. Communication (sharing) between gpu is costly.
1 Like