Unstable training on P100

Andres_Asensio_Ramos · January 15, 2018, 10:39am

Hi everyone,

I have been witnessing some unstable training in a large network in a P100. I managed to nicely train the network a couple of months ago but, after an update of the NVIDIA drivers to v387.26, I’ve been finding some unstable training in PyTorch, with losses going down much slower than before and getting validation losses with large oscillations.

Around that time I also updated the PyTorch version to 0.3.0, so at the beginning I thought it was related with the version of PyTorch and some subtle API change. However, after many tests, I think I have isolated the problem to the change in the version of the NVIDIA drivers because the training (with PyTorch 0.3 and the very same code) is smooth when using a Titan X with NVIDIA drivers 367.48.

Here are the details of the system, but the results are independent of the CUDA version (it also happens in CUDA 9.0)

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 387.26                 Driver Version: 387.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:02:00.0 Off |                  Off |
| N/A   30C    P0    24W / 250W |     29MiB / 12193MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Has anyone witnessed this kind of instability for this specific version of the drivers and Pytorch?

Thanks!

ajbrock · January 19, 2018, 8:34pm

I’ve had similar problems–no trouble training for a few months, updated to 0.3.0 and started training some new models which started to go wildly and unexpectedly unstable; the slight changes to my code made it hard to pin it on anything in particular. It disappeared for a while as I was training smaller models, but I’ve been running a ResNext50 on ImageNet and it’s had crazy unexpected up-and-downs (like, randomly jumping from 20% top-1 error to 60% top-1 val error, then going back down slowly, then randomly jumping up to 99% val error). I’m downgrading to 0.2.0 and running it again, will report back in a few days if it goes unstable or not.

ajbrock · January 22, 2018, 10:45pm

Yeah, just got through with a run after switching back to 0.2.0, and it was completely stable whereas the run on 0.3.0 with all identical code was crazy unstable. Not sure what to make of it; probably should try and run master at some point.

Andres_Asensio_Ramos · January 23, 2018, 1:08pm

After some days, I managed to find the time to run the very same code on v0.2.0 and I don’t find these instabilities. This happens not only on the P100 but also on a Titan X, where training the same network both in v0.3.0 and v0.2.0 give completely different losses and performances (with 0.3.0 being much worse). I will try to make some tests to see if I can isolate my problem. A priori I don’t know what could be happening because my networks are pretty simple ones using conv2d, batchnorms and relus.

ziming · September 4, 2018, 8:15pm

Hey, do you find the root cause of this issue?
I have a similar problem after updating pytorch from 3.0 to 4.1.
The training becomes unstable using the p100.
Would you recall the reason for this?

thanks

Andres_Asensio_Ramos · September 5, 2018, 8:52am

I never found the cause of this issue but found that increasing the value of eps in the BatchNorm2d layers made the convergence smooth again. My first guess was that it must be related with some precision problem when computing the mean and variance of the datasets that produces a somehow erratic behavior of the BatchNorm2d layer.

ziming · September 8, 2018, 9:49pm

Ok, Thanks man
I should have a try this

SimonW · September 8, 2018, 10:08pm

Were you using cudnn by any chance?

Andres_Asensio_Ramos · September 8, 2018, 10:21pm

Yes, I did. Maybe the non-deterministic calculations are the problem but I don’t know why this only happened on the P100.

SimonW · September 8, 2018, 10:23pm

I was asking because you mentioned that changing the BN eps parameter helped. PyTorch binaries of different versions ship with different cudnn (later PyTorch versions ship with newer cudnn of course). So it might be cudnn related as well. If the memory is not too big an issue, maybe you want to run with cudnn disabled and see if it helps.

Andres_Asensio_Ramos · September 11, 2018, 2:45pm

I will try and check. Thanks.