[resolved] Old matplotlib (2.2.3) somehow interrupts Pytorch recognizing CUDA device(or driver?)

TL;DR: After upgrading matplotlib to 3.0.3, the problem has gone.

I want to use a GPU memory report function mem(msg)with @contextmanager which would look like below.

Is there a proper way to do this? (something like using with memreport(): to monitor running GPU

@contextmanager
def mem(msg):
    #with torch.cuda.device(0):
    b4 = torch.cuda.memory_cached()
    yield
    freed = (torch.cuda.memory_cached()-b4)/1e9
    print(msg, freed, 'GiB freed')
    print(msg, torch.cuda.memory_allocated()/1e9, 'GiB currently allocated')

...
..
.

class Model(nn.Module):
...
    def forward(input):
        with mem('during the forward'):
            torch.cuda.empty_cache()

# expects freed cache and currently allocated GPU mem being printed out

but it results in error like this

AssertionError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from

detailed stacktrace here:

Traceback (most recent call last):
  File "train.py", line 462, in <module>
    main(args)
  File "train.py", line 149, in main
    batched_features, _ = encoder(batched_images) #[b, num_img, 2048]
  File "/root/anaconda3/envs/dl/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/workhere/Focalizer/model.py", line 155, in forward
    with mem('EncoderStory forward'):
  File "/root/anaconda3/envs/dl/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/workhere/Focalizer/model.py", line 33, in mem
    b4 = torch.cuda.memory_cached()
  File "/root/anaconda3/envs/dl/lib/python3.6/site-packages/torch/cuda/__init__.py", line 426, in memory_cached
    device = _get_device_index(device, optional=True)
  File "/root/anaconda3/envs/dl/lib/python3.6/site-packages/torch/cuda/_utils.py", line 28, in _get_device_index
    return torch.cuda.current_device()
  File "/root/anaconda3/envs/dl/lib/python3.6/site-packages/torch/cuda/__init__.py", line 341, in current_device
    _lazy_init()
  File "/root/anaconda3/envs/dl/lib/python3.6/site-packages/torch/cuda/__init__.py", line 161, in _lazy_init
    _check_driver()
  File "/root/anaconda3/envs/dl/lib/python3.6/site-packages/torch/cuda/__init__.py", line 82, in _check_driver
    http://www.nvidia.com/Download/index.aspx""")

Looks like they don’t see GPU devices within the with statement I defined.

Pytorch v1.0.1 | Ubuntu 1604 | CUDA 9.1.875

EDIT: in PDB, situations are exactly the same

Looks like I cannot call torch.cuda.memory_allocated() or similar monitoring functions while debugging with python -m pdb train.py. It shows exactly the same Assertion error with the reported above.

I tried with torch.cuda.device(0): torch.cuda.memory_allocated() in the pdb shell but it also fails with the msg as follows:

*** RuntimeError: cuda runtime error (35) : CUDA driver version is insufficient for CUDA runtime version at /opt/conda/conda-bld/pytorch_1549628766161/work/torch/csrc/cuda/Module.cpp:53

Gocha!

python decorator (e.g. @timeit ) decorated def forward() was causing the problem. I think anything related with context manager could mess up cuda context(?is the term right?) so that it makes the error reported above. How could this happen!

is this because that I called sth related to torch.cuda.somefunction() inside the decorator routine?

Sth Goofy still happening from the code level. Not in HW level. Pls help!

Is this sth to do with nv-docker ?

tried:
rm ~/.nv
and then reboot

tested with the another code: runs well with GPU

remove all the @timeit and @contextmanager thing

HI,

I would first check if nvidia-smi works? Do you see your gpus there?
If nvidia-smi fails to connect to the nvidia driver as well, then the “simplest” fix on ubuntu-like machines is to completely uninstall the nvidia drivers and then reinstall them (whichever way you installed them in the first place or google how to do it).

If nvidia-smi works fine, you can start a python shell and do

import torch
torch.cuda.is_available()
torch.rand(10, device="cuda")

Does it work? does it return True and a cuda tensor?

1 Like

Yeah I checked those already. Resetting GPU didnt made it work but downgrading pytorch with conda install pytorch=0.4.1=py36_cuda9.0.176_cudnn7.1.2_1 -c pytorch just solved the problem. I mean… 1.0.1(pytorch=1.0.1=py3.6_cuda9.0.176_cudnn7.4.2_2 -c pytorch) also worked great before I tried those things (@contextmanager thing) above.

Happy for those problems seems gone but not really understand what’s been problem how was it solved by what…

I’m using nvidia-docker on driver configuration like below does this kind of things happen usually? Im not getting it at all

#image I used
$docker images 
nvidia/cuda   9.1-base   ab8ac75abb13    6 months ago   133MB

$nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.116                Driver Version: 390.116                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:01:00.0 Off |                  N/A |
| 23%   28C    P8    16W / 250W |     22MiB / 12194MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp            Off  | 00000000:07:00.0 Off |                  N/A |
| 23%   23C    P8     8W / 250W |      2MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

$nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85

If you noticed, my nvcc version == 9.1.85 but pytorch from -c pytorch with the same version (like 4.0.0 with 9.1.85) doesnt worked at all. Things are getting ridiculous for me.

I don’t have any experience with nvidia-docker :confused:
Maybe @ngimel will have a better idea of what’s happening here?

My problem turns out to be related with matplotlib.pyplot import. It doesn’t interrupt CUDA driver finding gpus but in some state it does. Trying to reproduce it (after reproducing it I will report this for sure).

Thank you so much for taking care of my issue post.

It was the problem of old matplotlib 2.2.3 (conda-forge) with pytorch 1.0.1-cuda_9.0.176 (conda -c pytorch) . Upgrading it to matplotlib 3.0.3 resolved the problem.