Strange behavior observed for different GPUs

We implemented a DeepLab V3+ integrated with gradient reversal layer (GRL for short) and a domain classifier in order to achieve domain adaptation for some segmentation tasks. The code is at

We initially trained on a single RTX 3090, and it converged as we wished.


Later we began to train the model on a single RTX A6000 with exactly the same environment (say, PyTorch version, cuda toolkit version and even python version) as before. But the model seemed not to converge at all.

We did some initial debugging, by first training the original DeepLab V3+ model, and then adding the components (GRL, and then some other tricks for further optimization on performance) and training to see if it converges. And it turned out that the original DeepLab V3+ converged normally, and did not converge right after adding the GRL, which is implemented using autograd.Function. So we thought that it’s probably because of the GRL, but still failed to find a way to solve this.

So could you help us deal with this problem. Much gratitude for your help!

P.S. The dataset is not open-sourced for security reasons, and you may have to test based on other open-source dataset for transfer learning/domain adaptation, which I believe is not the reason for strange convergence in different GPUs. Sorry.

What did you try to isolate the issue so far? It seems GRL refers to a custom autograd.Function, which you have developed and which seems to cause the divergence?

Yes GRL is a custom autograd.Function for reversing the gradients.

And by “isolate”, we only tried removing GRL and the corresponding domain classifier to test whether the original DeepLab V3+ model converges, and it converged normally. We have also tried some open-source implementations of GRL on GitHub, and the GRL-integrated model still didn’t converge. However, we didn’t try something like testing our custom GRL in other models, since our GPU resources are limited.

We suspect that it might be something about compatibility of different hardware architectures, since the same code runs quite differently on 3090 and A6000.

I don’t know what “compatibility of different hardware architectures” means and to further debug we would need more information and in the ideal case a minimal and executable code snippet showing the difference in convergence.
As a quick test you could disable TF32 via torch.backends.cudnn.allow_tf32 = False (matmuls are not using TF32 by default but to make sure it’s also disabled use torch.backends.cuda.matmul.allow_tf32 = False) and check if this would make a difference.

OK I try that. Thank you.

With torch.backends.cudnn.allow_tf32 = False, training on A6000 still did not converge, and the loss exhibited similar patterns as before.

btw, we freeze the backbone of our model for the first 50 epochs, and then unfreeze it to adjust parameters of the whole model for the rest iterations. So a “jump point” around 50 epochs is also expected just like the original post.