Potential gradient precision problem for RTX 2080Ti?

I am experiencing a strange loss behavior when I migrated my coded to an RTX2080Ti machine from a Titan XP machine.

I am getting a large loss fluctuation when the loss is becoming small. I have suspicions that its a precision problem but I didn’t get any under flow error during training.

I have read some post about factory bad GPUs but I have tried identical code on 4 separate piece of GPU, they all have similar problems.

The loss is like this:

Four colors represents different GPUs on RTX2080Ti

However, the validation loss looks fine , can’t upload that due to new users :frowning:
I am using PyTorch 1.2, Cuda 10.0 from Anaconda3 on an Ubuntu 18.0 machine.
Any comments are welcomed cause this had bothered me for quite some time.

I was using SGD with LR: 1E-4 Momentum 0.5

Could you post the loss on your Titan XP?
Are there any other differences regarding the CUDA, cudnn, PyTorch versions between your Titan XP and RTX2080Ti runs?

Its something like this.

As seen its much smoother at small losses and converge much better.
I only have one Titan Xp GPU though.

I used CUDA 9.0 with cudnn 7501 and Pytorch 1.1.

RTX 2080Ti used cudnn 7602 acording to torch.backends.cudnn.version()

Do you think its a version problem? I wouldn’t be able to downgrade to CUDA 9.2 for RTX2080Ti I suppose.

Yeah, you shouldn’t use older CUDA versions than 10.0 for your RTX2080Ti, but could you rerun the script on your Titan XP using CUDA10.0, cudnn 7.6.2 and PyTorch 1.1?
It would give us an idea, where the difference might come from.

I haven’t been able to do as many steps as the blue curve, but here’s the result:

Seems a lot less jumpy than 2080 already when the loss gets small though more rough than the blue curve.

I think that tells there’s something wrong with 2080Ti?

I have also noticed that one of GPU is significantly hotter than the others. Don’t know if its related.

1 Like

I’m not sure at this point, how to interpret the loss curves.
The blue (good) curve looks like the x-axis is sampled differently.
Especially at the beginning it looks like the step size is not constant and you have some interpolated lines.
Are you using the same step size (e.g. wall time, iterations)?

Also, does the y-axis have the same range in all plots? If the blue line had some really high loss at the beginning, the rest might look smoother as it’s just smaller.

The blue one does have a lot more steps, sorry for not including the steps count

But I think its also clear that the RTX2080Ti run with a lot more fluctuations than Titan Xp.
I one of the run from RTX2080Ti and removed the smoothing for comparison.

And this is the Titan Xp results without smoothing, identical versions and data.