I want to expand a 4-GPU training scheme to 8 GPUs. I’m wondering if I can use the adjustment rules from the paper “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour”. The scheme proposed in the paper is for Distributed synchronous SGD. Even though the optim.SGD in PyTorch is not distributed version, but I’m assuming it is a synchronous one when it comes to multi-GPU training. I’m not sure if it’s the right assumption.