AMP twice as slow when using a different GPU

I have an NVIDIA A5000 RTX and an NVIDIA Titan RTX card.

When using PyTorch’s native amp, an epoch takes around 40 minutes on the Titan RTX as opposed to 2 hours on the A5000 RTX. Nothing else is changed. I run the same scripts merely changing CUDA_VISIBLE_DEVICES from 0 to 1 using the different GPU.

This seems very strange to me. What am I missing?

Could you post a minimal, executable code snippet as well as the output of python -m torch.utils.collect_env, please?