Training loss is different when using different pytorch version

I want to upgrade pytorch version from 1.7.1 to 1.11.0.

Before do that, I want to generate same result from 1.7.1 and 1.11.0

I tested my network based on gpt2 with various pytorch version but the results are like above.

I set same random seed value (=0). But the loss graphs are slightly different.

I used same code to test and use cuda version for pytorch as 11.x (1.7.1-cuda11.0, 1.8.0&1.9.0-cuda11.1, 1.11.0-cuda11.3)

Base docker image for testing is cuda11.3-ubuntu 18.04.

I tried to find some differences between 1.7.1 and 1.11.0 but I could not find any significant differences between them…

Is there any possible solution or any advise to solve this problem?

Hi, I am not sure what your network architecture is but in my opinion, maybe if you are not using any weight initialization for your layers then from version to version there may be other basic weights in layers. Also some other functions can be improved.

I used xavier_uniform, xavier_normal in custom attention layer and reset_parameter function in pre-defined pytorch layers e.g: nn.Linear, nn.LayerNorm, nn.Dropout.

Your answer is related to reset_parameter function in torch layers.

Thanks to reply my question. I try it!

Also if some guys have a solution or advise, then please reply your answer!

I was just wondering if you have tried the seed control as discussed here.

I just used torch.manual_seed, random.seed, np.random.seed, torch.backends.cudnn.deterministic = True, torch.backends.cudnn.benchmark = False.

Your link describes additional methods for reproducibility.

Thanks! I will try it!

I have not gotten a meaningful solution…

I guess that cuda versions of each pytorch version are different so the models are trained in slightly different manner.

I want more advise or possible solutions nowadays…

Completely reproducible results are not guaranteed across PyTorch releases, individual commits, or different platforms. Furthermore, results may not be reproducible between CPU and GPU executions, even when using identical seeds.

There are numerous factors affecting reproducibility.
What about the quantitative results in terms of accuracy?
You haven’t talked about it. Can you pl. share if they are more or less equal?

Your saying is that if I train models in same machine environment (like same CPU and GPU), then completely reproducible results are not guaranteed across PyTorch releases?

This is validation result (validation RMSE)

This is validation loss.

There are two points:

  1. Reproducibility between different PyTorch releases, individual commits, or different platforms is not guaranteed regardless of the same CPU/GPU usage.

  2. If you run the same code in CPU mode vs. GPU mode, still the reproducibility is not guaranteed.

You can verify if the quantitative results are close enough (RMSE, Accuracy, mAP, F1 score etc.,) in all the versions.