CUDA Error(30) after PyTorch running

ludl · April 9, 2020, 3:03am

I often met a problem that PyTorch reported Error(30) after I ran train.py or infer.py using PyTorch. I always resolved it through 3 ways:

reboot system, then the problem is dissapeared.
Reinstall NVIDIA driver and CUDA if rebooting system can’t resolve problem.
Reinstall system. Sometimes reboot system and reinstall driver still can’t resolve the problem, I must reinstall system. I don’t think it’s the problem of NVIDIA driver or CUDA, because both nvidia-smi and deviceQuery can work normally after I reboot system or reinstall driver. Although deviceQuery can work normally, it will report Error 30 as long as PyTorch reports error when I run train.py or infer.py. So I think PyTorch makes this problem, but I don’t know why and how to resolve it, it bothers me for a long time. Anyone can give me some advices? Thanks!

Enviroment:
NVIDIA 1080 or 1080TI
CUDA 8.0, 9.0 or 10.0
ubuntu 16.04 or 18.04
PyTorch 0.4.0 or 1.0.0
I encountered this problem in all above enviroment.

ptrblck · April 9, 2020, 6:11am

Could you update CUDA and PyTorch to the lastest versions?
I assume you are building from source or are you using the binaries?

ludl · April 9, 2020, 8:48am

I ever updated NVIDIA driver to newest version and CUDA from 8.0 to 10.0, but the problem was still existed. I didn’t update PyTorch because I didn’t know if algorithm could compile in latest PyTorch version, but I can try it. Thanks!
I want to know the relation between PyTorch and CUDA, can PyTorch destroy CUDA environment? If the worst case occurs, firstly, deviceQuery shows PASS after I reboot system, secondly, PyTorch always reports unknown error when I run it, then deviceQuery shows error 30. Reboot and reinstall driver and CUDA can’t resolve problem, I must reinstall ubuntu system. I think if I know the relation between PyTorch and CUDA, I can back up files which are modified by PyTorch, I can recover them after CUDA environment is broken, at least recovery files is better than reinstalling system.

ptrblck · April 9, 2020, 9:12am

I have never encountered the issue that you would have to reinstall anything.
For now I would suggest to use docker to avoid reinstalling your complete system, while I’m asking around if somebody has seen this issue before.

ludl · April 9, 2020, 9:19am

Thanks a lot!
The problem occured again just now, I am updating PyTorch to newer version, I will update testing result using newer PyTorch version later.

ptrblck · April 9, 2020, 9:21am

Which Ubuntu version are you using?
I would highly recommend to use docker for now. Reinstalling Ubuntu everytime after a crash is more than painful.

ludl · April 9, 2020, 9:39am

Both ubuntu16.04 and 18.04 are ever used. Reinstalling system doesn’t spend more time because ubuntu is running on KVM. By the way, can PyTorch run stably on KVM environment? I think there is no problem because many algorithms base on PyTorch I used can work correctly on KVM virtualization environment.

ludl · April 9, 2020, 9:55am

added: I used anaconda3 to create PyTorch environment.

ptrblck · April 9, 2020, 10:06am

I never used KVM and don’t know, if that might cause any issues.

ludl · April 9, 2020, 10:50am

Thank you for your reply. I doubt this problem may be related to memory, I just give 40GB memory to ubuntu system, with the training goes step by step, the memories grow higher, at last the memories are nearly exhausted, then training is broken or blocked, and this problem occurs.

ptrblck · April 10, 2020, 3:19am

By “give 40GB to Ubuntu” I assume you can share your local RAM with the KVM?
Are you storing large amounts of data in your script?