How to implement accumulated gradient？

Gopal_Sharma · October 12, 2018, 1:35pm

Yeah. Batch normalization is tricky to get right in multi-gpu setting. This is mainly because BN requires calculating mini-batch mean and thus require information of tensors on other gpus. Communication (sharing) between gpu is costly.