GPU gives different results than CPU in training the model

MUSTAFA_CAGRI_CIVICI · March 4, 2025, 7:59am

I have a quick question. What I noticed is that GPU leads the model overfitting while CPU gets good result. My data is very small so I built the model basic. But interestingly CPU gave good results then I wanted the model runs through GPU so that it moves fast. But i am seeing dramatic bad results in val_loss. But the results can be considered as good in CPU.

I searched from somewhere, then it is said that CPU accidentally regulates the model by moving slow in updating gradients. Is it correct? What are your thoughts?

ptrblck · March 4, 2025, 1:16pm

No, a slower compute won’t regularize the model in any way.

Naming-isDifficult · March 5, 2025, 6:29am

Though I cannot convince myself this is where the issue arises, have you tried to fix your random seed?

import torch
import random
import numpy as np

torch.manual_seed(42)
random.seed(42)
np.random.seed(42)

ptrblck · March 5, 2025, 1:16pm

Setting the seed will not guarantee to reproduce exactly the same values on different devices.

MUSTAFA_CAGRI_CIVICI · March 5, 2025, 1:16pm

I did as @Naming-isDifficult mentioned. But when i kept searching, i discovered that each results after every single training are different from each other. I was thinking that randomness would be the same in each run if i set seeds with a generic number. But some results are very dramatically different somehow. It is very annoying to catch the best model evaluation. Even without doing anything btw these 2 runs, i got these scores.

Note: testing accuracy was meant to say ‘validation’. it is written wrong.

My data would be exposed by image augmentation (cv2.compose) with random.choice. the data is splitted into train and val in a random way (seed = 42), torch tensor would gather the data in a shuffle way.

Is there any proper solution to get the model to give the same results in each run? Did i miss anyhing in this concept?

roach · March 7, 2025, 6:12pm

Could you try to initialize a model (cpu or gpu), then save the initialized state of the weights and run it without shuffles and augmentations, then port the initial model to the 2nd device and repeat.
It should come back identical (i think).

lostalot · March 7, 2025, 6:26pm

You might want to see this paper about randomness and results: Hadges, A., & Bellur, S. (2025). Statistical Validity of Neural-Net Benchmarks. IEEE Open Journal of the Computer Society, 6(01), 211-222. CSDL | IEEE Computer Society