Hi! I’ve faced the same issue. The solution was to increase learning rate proportional to the increase in the total batch size - as explained in details in this great article.
In particular when I’ve used same learning rate for training on 1 GPU and 4 GPUs - there was no speed up at all. But when I’ve multiplied the learning rate by 4 for the 4 GPUs case - it converged much faster. The speed up was almost 4x - linear to the number of GPUs actually.