I’m using the same PyTorch version 12.7.
I’m training the same model on both the 3090 and the 5090 — there are no problems on the 3090, but on the 5090, the performance metrics suddenly drop significantly.
I suspect the cause is the combination of a small value range and AMP.
When I use a large value range (around 200 to 3000), AMP doesn’t cause any issues. However, when I normalize the values to a range of -1 to 1 or 0 to 1, the train_loss
remains stable during training, but the test_loss
suddenly spikes.
If this also happened on the 3090, I’d assume it was a problem in my code, but since it only happens on the 5090 — and disappears when using a large value range — I think that’s the cause.
I haven’t tested training without AMP yet because I can’t fit the model into VRAM without it.
How can I resolve this issue?