My local env. is like below,
- Nvidia driver version: 525.89.02
and I run torch with my docker container with source code mounted to some dir.
My container env. is like below,
- torch: 2.0.1
- cuda: 11.7
The problem is that, it works fine if I run&exit by code 0. Also, the GPU util maintains 99% almost always.
But whenever I stop run(by clicking pycharm stop button) or run stops by an error, I can’t use GPU for the next runs.
Runs occupy GPU memory fully, but util goes to 0% for all the time, and the training requires bunch of time as much as CPU training.
I tried the commands below according to other forum issue, but didn’t work.
$ sudo rmmod nvidia_uvm
$ sudo modprobe nvidia_uvm
Is there any other way to solve problem like this?
Now I’m keep rebooting my PC whenever the symtom occurs…