Error when profiling with utils.bottleneck

kowshik_thopalli · March 8, 2019, 6:57am

Hi,
I was trying to profile my code with instructions from https://pytorch.org/docs/stable/bottleneck.html

The code runs without errors in general with cuda on a single GPU.

However, whenever I try to profile with the command
python -m torch.utils.bottleneck /path/to/source/script.py [args]

I get the following error

Traceback (most recent call last):
  File "/home/kowshik/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/kowshik/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/kowshik/anaconda3/lib/python3.6/site-packages/torch/utils/bottleneck/__main__.py", line 234, in <module>
    main()
  File "/home/kowshik/anaconda3/lib/python3.6/site-packages/torch/utils/bottleneck/__main__.py", line 213, in main
    autograd_prof_cpu, autograd_prof_cuda = run_autograd_prof(code, globs)
  File "/home/kowshik/anaconda3/lib/python3.6/site-packages/torch/utils/bottleneck/__main__.py", line 107, in run_autograd_prof
    result.append(run_prof(use_cuda=True))
  File "/home/kowshik/anaconda3/lib/python3.6/site-packages/torch/utils/bottleneck/__main__.py", line 100, in run_prof
    with profiler.profile(use_cuda=use_cuda) as prof:
  File "/home/kowshik/anaconda3/lib/python3.6/site-packages/torch/autograd/profiler.py", line 180, in __enter__
    torch.autograd._enable_profiler(profiler_kind)
**RuntimeError: /opt/conda/conda-bld/pytorch_1544174967633/work/torch/csrc/autograd/profiler.h:72: all CUDA-capable devices are busy or unavailable**

How do I avoid this

kowshik_thopalli · March 9, 2019, 5:32am

Can someone help me with this please

ptrblck · March 9, 2019, 12:15pm

Could this be the issue?

kowshik_thopalli · March 11, 2019, 5:09pm

Thanks @ptrblck.

I have only one GPU. The other GPU is for display purposes and pytorch doesnt support it

ptrblck · March 11, 2019, 5:12pm

Thanks for the information!
Let’s try to narrow down the source of the problem.

Are other processes working on the GPUs?
Are you able to create a tensor on all devices?
If so, is nn.DataParallel running successfully?

kowshik_thopalli · March 11, 2019, 5:27pm

Thanks Peter. I have updated my response with a new image. I have only one GPU

ptrblck · March 11, 2019, 5:31pm

That’s interesting. Maybe PyTorch tried to create the CUDA context on GPU0, which might fail.
Could you try to run your script from the terminal using:

CUDA_VISIBLE_DEVICES=1 python -m torch.utils.bottleneck script.py args

kowshik_thopalli · March 11, 2019, 5:42pm

CUDA_VISIBLE_DEVICES=1, doesnt work for me. Instead I have to use CUDA_VISIBLE_DEVICES=0 for it to run on GPU.
However, now when I run with CUDA_VISIBLE_DEVICES=1 python -m torch.utils.bottleneck script.py args, i get the following error
RuntimeError: /opt/conda/conda-bld/pytorch_1544174967633/work/torch/csrc/autograd/profiler.h:72: out of memory

ptrblck · March 11, 2019, 5:44pm

Yeah, the order of your GPUs might be different than shown in nvidia-smi.
OK, so at least we got now an OOM error.
Could you create a dummy script with a low memory usage and try to run bottleneck with it just to see if the first issue is solved?