I have a question about the allocation of a fix batch size for different number of GPUs

I set batch size as 80, and use 4 GPUs to train the model, its loss of the first several steps is as the red line in image
However, when I use 5 GPUs to train, its loss turns into the green line in image

My GPU is 40G-A100, and torch version is 1.10.0, and my loss function doesn’t depend on other samples
I have trained the model more epoches in different GPU numbers, which got a larger gap in Loss value and performance (5 GPUs is better than 4).
I sincerely wonder an answer and explaintion for this phenomenon. Thanks in advance.

My batch size is fixed at 80 from the beginning to the end.

In theory, why would the same batch size produce different results?

Do you have a code for us to take a look? What do you use for model training? Did you use DistributedDataParallel, FSDP or? And what about data loader you were using?