PyTorch Model Reproducibility Issue: RTX 2080 Ti vs. RTX 4090 Ti

Here’s a polished and professional version of your forum post describing the reproducibility issue between RTX 2080 Ti and RTX 4090 Ti:


PyTorch Model Reproducibility Issue: RTX 2080 Ti vs. RTX 4090 Ti

Background:​
I’m training the same PyTorch model on two different GPUs but observing significant differences in performance, particularly on the RTX 4090 Ti, which appears to overfit and fails to match the results achieved on the RTX 2080 Ti.

Environment & Hardware Details

1. RTX 2080 Ti (Original Training Setup)​

{
    'CUDA available': True,
    'CUDA_HOME': None,
    'GCC': 'gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0',
    'GPU 0': 'NVIDIA GeForce RTX 2080 Ti',
    'LibMTL': 'modify 1.1.6',
    'Numpy': '1.21.5',
    'Platform': 'linux',
    'PyTorch': '1.8.0',
    'Python': '3.7.0 (default, Oct 9 2018, 10:31:47) [GCC 7.3.0]',
    'TorchVision': '0.9.0'
}

Epoch 5 Results (2080 Ti):​

[2022-11-29 12:02:56,707][INFO][recorder.py][85] - Epoch: 5 ----- Mode: train
{'buchwald': {'loss': 0.53, 'metrics': {'R2': 0.715}},
 'time': {'backward_time': 12.35, 'data_time': 1.145, 'forward_time': 8.286}}
----------------------------------------------------------------------------------------------------
[2022-11-29 12:02:59,903][INFO][recorder.py][85] - Epoch: 5 ----- Mode: val
{'buchwald': {'loss': 0.537, 'metrics': {'R2': 0.698}},
 'time': {'data_time': 0.014, 'forward_time': 2.049}}
----------------------------------------------------------------------------------------------------
[2022-11-29 12:03:05,232][INFO][recorder.py][85] - Epoch: 5 ----- Mode: test
{'buchwald': {'loss': 0.502, 'metrics': {'R2': 0.749}},
 'time': {'data_time': 0.025, 'forward_time': 4.171}}

2. RTX 4090 Ti (New Training Setup)​

{
    'CUDA available': True,
    'CUDA_HOME': '/usr/local/cuda',
    'GCC': 'gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0',
    'GPU 0,1,2,3,4,5': 'NVIDIA GeForce RTX 4090',
    'LibMTL': 'modify 1.1.6',
    'NVCC': 'Cuda compilation tools, release 11.8, V11.8.89',
    'Numpy': '2.0.2',
    'Platform': 'linux',
    'PyTorch': '2.7.1+cu128',
    'Python': '3.9.23 (main, Jun 5 2025, 13:40:20) [GCC 11.2.0]',
    'TorchVision': '0.22.1+cu128'
}

Epoch 5 Results (4090 Ti):​

[2025-07-17 18:02:06,172][INFO][recorder.py][85] - Epoch: 5 ----- Mode: train
{'buchwald': {'loss': np.float64(0.284), 'metrics': {'R2': 0.924}},
 'time': {'backward_time': 5.906, 'data_time': 0.033, 'forward_time': 10.398}}
----------------------------------------------------------------------------------------------------
[2025-07-17 18:02:08,721][INFO][recorder.py][85] - Epoch: 5 ----- Mode: val
{'buchwald': {'loss': np.float64(0.631), 'metrics': {'R2': 0.593}},
 'time': {'data_time': 0.001, 'forward_time': 2.539}}
----------------------------------------------------------------------------------------------------
[2025-07-17 18:02:14,147][INFO][recorder.py][85] - Epoch: 5 ----- Mode: test
{'buchwald': {'loss': np.float64(0.549), 'metrics': {'R2': 0.71}},
 'time': {'data_time': 0.0, 'forward_time': 5.41}}

Key Observations:​

  1. Overfitting on 4090 Ti:
  • The training loss (0.284) is much lower than on the 2080 Ti (0.53), but the validation (0.631 vs. 0.537) and test (0.549 vs. 0.502) losses are worse.
  • The R2 metric drops significantly on validation (0.593 vs. 0.698) and test (0.71 vs. 0.749).
  1. Failed to Reproduce Original Performance:
  • Even after hyperparameter tuning, the model on the 4090 Ti cannot reach the previous best R2 of 0.95 achieved on the 2080 Ti.

Possible Causes & Questions:​

  • Numerical Differences: Are there any known non-determinism issues between PyTorch 1.8.0 (2080 Ti) and 2.7.1 (4090 Ti)?
  • Hardware Differences: Does the 4090 Ti’s architecture (Ada Lovelace vs. Turing) affect training dynamics?

Any insights or suggestions would be greatly appreciated!"​


Would you like me to suggest potential solutions (e.g., deterministic training flags, mixed precision adjustments)?

The differences between the used setups are quite large as PyTorch 1.8.0 was released in March 2021 while 2.7.1 was released in May 2025. It could help if you could double check the outputs using any versions between these releases and also it could help to measure the reproducibility of these different outputs making sure something indeed changed and you are not measuring random noise.