Multi GPU training, Data parallel vs Apex ddp for Semantic segmentation

Hello. I am currently using the 2 GPU machines in the lab.
The first one, it has 2 Titan V.
Second, the other has 4 Titan V.
When I train the same dataset but 2 times bigger batch size but 2 titan V result is better.
The network is Mobilenet V3 based network.
I have read some articles, and they said that Sync batch normalization could be helpful.
I use the Apex version of Sync batch normalization.
Preformatted textmodel = apex.parallel.convert_syncbn_model(model)
However, still, 2 GPUs have a better result.
So I am looking for another article. They said in some cases, the DDP is required if I use a lot of GPU.
Does anyone have the same experience when you use multi GPU training?

The second thought is the multi-scale training, I read other published paper with code. They use multi-scale training with multiple GPUs. Does this effects the segmentation result?
Thank you.

Did you play around with some hyperparameters, e.g. did you try to lower the learning rate for the multi-GPU setup?
I assume your model doesn’t converge that well using 4 GPUs compared to the 2GPU run?

I use the same hyper-parameters, same loss function, same model same initialization technique…
My dataset as you remember that it is quite imbalanced data.
For this reason, I use the Focal-Tversky loss(in my current experience it gives the best result).
However, if I use more number of GPU should I reduce the lower learning rate?
Could you teach me why I should reduce the learning rate?
Thank you for your answer.

I’ve seen some experiments for large scale systems, where the learning rate was adapted to the batch size as seen in Training ImageNet in 1 hour.
However, this effect should be much smaller in your setup. Might still be worth a try to lower it and see, if it changes the convergence.

Yes I will try. These days, I realize that experience is very important about this field for optimization.
Thank you.