Comparing Accuracy: Single GPU vs. 8 GPUs

Would PyTorch yield different accuracy when executed on 8 GPUs (distrusted mode) compared to running on 1 GPU? Is it expected to observe variations in results? For instance, the accuracy on a single GPU for the DTD dataset is 50.1%, whereas when utilizing 8 GPUs, it is reported as 54.1% using ViT-B/16.

That difference sounds to be large to be expected. Could you explain more details of the setup (the batch size in each case, the types of GPUs used, whether quantization or mixed-precision is used), and whether you are observing this difference in training, validation, or both?