TL;DR: After upgrading matplotlib to 3.0.3, the problem has gone.
I want to use a GPU memory report function mem(msg)
with @contextmanager
which would look like below.
Is there a proper way to do this? (something like using with memreport():
to monitor running GPU
@contextmanager
def mem(msg):
#with torch.cuda.device(0):
b4 = torch.cuda.memory_cached()
yield
freed = (torch.cuda.memory_cached()-b4)/1e9
print(msg, freed, 'GiB freed')
print(msg, torch.cuda.memory_allocated()/1e9, 'GiB currently allocated')
...
..
.
class Model(nn.Module):
...
def forward(input):
with mem('during the forward'):
torch.cuda.empty_cache()
# expects freed cache and currently allocated GPU mem being printed out
but it results in error like this
AssertionError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from
detailed stacktrace here:
Traceback (most recent call last):
File "train.py", line 462, in <module>
main(args)
File "train.py", line 149, in main
batched_features, _ = encoder(batched_images) #[b, num_img, 2048]
File "/root/anaconda3/envs/dl/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/workhere/Focalizer/model.py", line 155, in forward
with mem('EncoderStory forward'):
File "/root/anaconda3/envs/dl/lib/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "/workhere/Focalizer/model.py", line 33, in mem
b4 = torch.cuda.memory_cached()
File "/root/anaconda3/envs/dl/lib/python3.6/site-packages/torch/cuda/__init__.py", line 426, in memory_cached
device = _get_device_index(device, optional=True)
File "/root/anaconda3/envs/dl/lib/python3.6/site-packages/torch/cuda/_utils.py", line 28, in _get_device_index
return torch.cuda.current_device()
File "/root/anaconda3/envs/dl/lib/python3.6/site-packages/torch/cuda/__init__.py", line 341, in current_device
_lazy_init()
File "/root/anaconda3/envs/dl/lib/python3.6/site-packages/torch/cuda/__init__.py", line 161, in _lazy_init
_check_driver()
File "/root/anaconda3/envs/dl/lib/python3.6/site-packages/torch/cuda/__init__.py", line 82, in _check_driver
http://www.nvidia.com/Download/index.aspx""")
Looks like they don’t see GPU devices within the with statement
I defined.
Pytorch v1.0.1 | Ubuntu 1604 | CUDA 9.1.875
EDIT: in PDB, situations are exactly the same
Looks like I cannot call torch.cuda.memory_allocated() or similar monitoring functions while debugging with python -m pdb train.py
. It shows exactly the same Assertion error with the reported above.
I tried with torch.cuda.device(0): torch.cuda.memory_allocated()
in the pdb shell but it also fails with the msg as follows:
*** RuntimeError: cuda runtime error (35) : CUDA driver version is insufficient for CUDA runtime version at /opt/conda/conda-bld/pytorch_1549628766161/work/torch/csrc/cuda/Module.cpp:53