A PTX JIT compilation failed

Daniel_Joseph · October 11, 2019, 3:49am

Hi Everyone, I got this error on remote computer, can you please help me solve it,

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1523240155148/work/torch/lib/THC/generic/THCStorage.cu line=58 error=78 : a PTX JIT compilation failed
Traceback (most recent call last):
  File "/home/youcef/.vscode-server/extensions/ms-python.python-2019.10.41019/pythonFiles/ptvsd_launcher.py", line 43, in <module>
    main(ptvsdArgs)
  File "/home/youcef/.vscode-server/extensions/ms-python.python-2019.10.41019/pythonFiles/lib/python/old_ptvsd/ptvsd/__main__.py", line 432, in main
    run()
  File "/home/youcef/.vscode-server/extensions/ms-python.python-2019.10.41019/pythonFiles/lib/python/old_ptvsd/ptvsd/__main__.py", line 316, in run_file
    runpy.run_path(target, run_name='__main__')
  File "/home/youcef/don/yes/envs/HYmenv2/lib/python2.7/runpy.py", line 252, in run_path
    return _run_module_code(code, init_globals, run_name, path_name)
  File "/home/youcef/don/yes/envs/HYmenv2/lib/python2.7/runpy.py", line 82, in _run_module_code
    mod_name, mod_fname, mod_loader, pkg_name)
  File "/home/youcef/don/yes/envs/HYmenv2/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/home/youcef/hyps/HYMC/HSIGCN_11102019 (ssh)/main.py", line 88, in <module>
    model = model.cuda()
  File "/home/youcef/don/yes/envs/HYmenv2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 216, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/home/youcef/don/yes/envs/HYmenv2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 146, in _apply
    module._apply(fn)
  File "/home/youcef/don/yes/envs/HYmenv2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 146, in _apply
    module._apply(fn)
  File "/home/youcef/don/yes/envs/HYmenv2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 152, in _apply
    param.data = fn(param.data)
  File "/home/youcef/don/yes/envs/HYmenv2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 216, in <lambda>
    return self._apply(lambda t: t.cuda(device))
  File "/home/youcef/don/yes/envs/HYmenv2/lib/python2.7/site-packages/torch/_utils.py", line 69, in _cuda
    return new_type(self.size()).copy_(self, async)
  File "/home/youcef/don/yes/envs/HYmenv2/lib/python2.7/site-packages/torch/cuda/__init__.py", line 387, in _lazy_new
    return super(_CudaBase, cls).__new__(cls, *args, **kwargs)
RuntimeError: cuda runtime error (78) : a PTX JIT compilation failed at /opt/conda/conda-bld/pytorch_1523240155148/work/torch/lib/THC/generic/THCStorage.cu:58

I run the same program on desktop , it ran without any errors. I checked before posting the available anzswers on similar topic , none of them worked.
Thanks in advance

tom · October 11, 2019, 9:04am

So the PTX JIT compilation only kicks in when you have a CUDA compute arch for your hardware that isn’t supported by the binary you are running and the PTX JIT jumps in to bridge that gap.
So the questions would be

What is the compute arch of your hardware? (If you don’t know, Wikipedia has the information for “sales name -> arch”.)
What is are the compute archs included in your PyTorch?
Is something up with the CUDA installation that makes it fail?

Typically, official PyTorch binaries come with the supported arch binaries all compiled while by default self-compile PyTorch only comes with the arch of the hardware you compile on.
You can use cuobjdump /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch.so | grep 'arch' | sort | uniq to check what PyTorch has.
For example, on my GTX1080Ti, a self-compiled PyTorch will have only arch = 6.1 while the 1.2 wheel from the PyTorch site has 30,35,50,60,61,70,75.

The easiest remedy is likely to deploy a PyTorch that includes the right arch binaries.

Best regards

Thomas

Daniel_Joseph · October 14, 2019, 8:44am

Thanks your explanation @tom , it is Solved !

SeanChenxy · October 19, 2019, 9:35am

Hi, I have a similar error. Could you tell me your solution?

Daniel_Joseph · October 30, 2019, 9:08am

I was just missing some files on the remote computer, that was the problem.