Torch gpu util 0% if previous run fails by error or stopped

ken.dbc.public · November 8, 2023, 8:29am

My local env. is like below,

Ubuntu20.04
Nvidia driver version: 525.89.02

and I run torch with my docker container with source code mounted to some dir.
My container env. is like below,

Ubuntu20.04
torch: 2.0.1
cuda: 11.7

The problem is that, it works fine if I run&exit by code 0. Also, the GPU util maintains 99% almost always.
But whenever I stop run(by clicking pycharm stop button) or run stops by an error, I can’t use GPU for the next runs.
Runs occupy GPU memory fully, but util goes to 0% for all the time, and the training requires bunch of time as much as CPU training.

I tried the commands below according to other forum issue, but didn’t work.
$ sudo rmmod nvidia_uvm
$ sudo modprobe nvidia_uvm

Is there any other way to solve problem like this?
Now I’m keep rebooting my PC whenever the symtom occurs…