Simple test for mixed precision on RTX 2070?


(Eric Perbos-Brinck) #1

Greetings,

I’m currently on Fast.ai mooc, using their fastai V1 built over PyTorch 1.0.
It works like a charm on a 1080Ti + Ryzen 1700X, on Ubuntu 16.04 and Nvidia 410.73.

With the new generation of Nvidia RTX cards offering Tensor Cores, and the possibility of FP16 training via mixed-precision, I got hold of an RTX 2070.
When I try and run my usual Jupyter notebook in mixed precision on the RTX 2070, it crashes the kernel (without a specific error message to track the issue, just “The kernel appears to have died. It will restart automatically.”).

So I thought maybe the first step would be to go down one level to pure PyTorch code and run a “basic test” that would check if/how it triggers the Tensor Cores and FP16 training ?

Is that possible and if so, how should I proceed ?

Best regards,

EricPB


#2

If your Jupyter Notebook kernel just dies, you could try to download your notebook as a Python script (.py) and run it in a terminal. This will usually yield a better error message.
Note that CUDA operations work asynchronously, so you might need to run your script with:

CUDA_LAUNCH_BLOCKING=1 python script.py args

(Eric Perbos-Brinck) #3

Thank you @ptrblck for your fast and explicit reply !

I will try your tip asap and report back.

BR,

EPB


(Eric Perbos-Brinck) #4

I run my notebook as a Python script and the error message is, right at start of the epoch:

Floating point exception (core dumped)


#5

Thanks for information!
Could you try to get the backtrace using these commands?

As far as I’ve understood your question, the script runs fine without FP16?


(Eric Perbos-Brinck) #6

Yes, the script works fine without FP16.

I got this traceback in two pictures:


(Eric Perbos-Brinck) #7


#8

Thanks for the backtrace.
Skimming through it, could it be you are using torch.float data somewhere in torch.half layers?
Could you post your model definition?


(Eric Perbos-Brinck) #9

This might take some time to answer as I’m using a script for Cifar10 with a wrn_22() model from the high-level fastai library (like Keras for PyTorch) in the current mooc.


(Eric Perbos-Brinck) #10

I think this repo is dedicated to explore mixed-precision with PyTorch.

I’m running the scripts with the World Language Model, and can see a slight performance boost (+15%) with --fp16 on the 2070.


#11

Yes, apex makes sure mixed precision models work fine, i.e. potentially unsafe ops are performed in FP32, while other operations are performed using FP16.
I wanted to post this as the next suggestion, but you were faster. :wink:
Were you able to run your script with apex?
I’m not sure, how easy that would be using the fast.ai wrapper.


(Eric Perbos-Brinck) #12

Maybe another tip via a ticket on PyTorch GitHub for the same

Floating point exception (core dumped)
https://github.com/pytorch/pytorch/issues/9465

The cause is a bug in CUdnn 7.1.4, didn’t exist in 7.1.2 and was fixed in 7.2.

When I check my current package of pytorch-nightly, it’s named “1.0.0.dev20181024-py3.7_cuda9.2.148_cudnn7.1.4_0 pytorch [cuda92]”


#13

+1, this works like a charm
Also implements loss scaling, which I found to be necessary

On small models you’ll not see much of an uplift, but on a big imagenet model like resnet18 or resnet50 you should see ~2x the performance (at least on a V100).

Also make sure you have the latest cuDNN, they’re up to 7.3.1 now

Edit: Apex works like a charm