We have a point cloud vision model that fails to run using torch.jit
and nvFuser
during the forward pass. Unfortunately I am unable to share the model or code publicly, but I am hoping that I can get some generic guidance that I can investigate further.
I have tested with both PyTorch 1.12 and 1.13 and the same error message appears. Unexpectedly, the difference is that in 1.12 it fails in the backwards pass and can run forwards-only, but in 1.13 I get the same error message during the forwards pass.
In this case, the ScriptModule
is created using torch.jit.script
, and forward-pre-hooks are removed as they are not JIT compatible.
With PyTorch 1.12 and the default environment settings, torch.jit
gives the following error:
/home/*****/intel/oneapi/intelpython/latest/envs/*****_pytorch1-12/lib/python3.10/site-packages/torch/autograd/__init__.py:173: UserWarning: FALLBACK path has been taken inside: runCudaFusionGroup. This is an indication that codegen Failed for some reason.
To debug try disable codegen fallback path via setting the env variable `export PYTORCH_NVFUSER_DISABLE=fallback`
(Triggered internally at /opt/conda/conda-bld/pytorch_1659484808560/work/torch/csrc/jit/codegen/cuda/manager.cpp:329.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Setting the PYTORCH_NVFUSER_DISABLE
gives the following detailed traceback:
Traceback (most recent call last):
File "/*****/train*****.py", line 262, in <module>
main(args, runtime_manager)
File "/*****/train*****.py", line 195, in main
scaler.scale(loss).backward()
File "/home/*****/intel/oneapi/intelpython/latest/envs/*****_pytorch1-12/lib/python3.10/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/*****/intel/oneapi/intelpython/latest/envs/*****_pytorch1-12/lib/python3.10/site-packages/torch/autograd/__init__.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: Vectorized dim has to be from a contiguous inner most position: T46_l[ iblockIdx.y292{T8.size[1]}, sbS221{( ceilDiv(1, gridDim.z) )}, iS229{( ceilDiv(( ceilDiv(( ceilDiv(( ceilDiv(T8.size[0], 4) ), blockDim.x) ), 1) ), gridDim.x) )}, iS293{T8.size[3]}, sbblockIdx.z220{gridDim.z}, iblockIdx.x228{gridDim.x}, ithreadIdx.x225{blockDim.x}_p, iUS227{1}, iV223{4} ] ca_pos( 8 )
In PyTorch 1.13 the no-fallback error message is:
Traceback (most recent call last):
File "/*****/train*****.py", line 263, in <module>
main(args, runtime_manager)
File "/*****/train*****.py", line 189, in main
seg_pred, trans_feat = classifier(*local_module._neighbours(local_module, (points,)))
File "/home/*****/intel/oneapi/intelpython/latest/envs/*****_pytorch1-13/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/*****/intel/oneapi/intelpython/latest/envs/*****_pytorch1-13/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/home/*****/intel/oneapi/intelpython/latest/envs/*****_pytorch1-13/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0])
File "/home/*****/intel/oneapi/intelpython/latest/envs/*****_pytorch1-13/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: Vectorized dim has to be from a contiguous inner most position: T36_l[ iblockIdx.y226{T0.size[1]}, bS204{( ceilDiv(1, gridDim.z) )}, iS212{( ceilDiv(( ceilDiv(( ceilDiv(( ceilDiv(i0, 8) ), blockDim.x) ), 1) ), gridDim.x) )}, iS105{i4}, bblockIdx.z203{gridDim.z}, iblockIdx.x211{gridDim.x}, ithreadIdx.x208{blockDim.x}_p, iUS210{1}, iV206{8} ] ca_pos( 8 )
Note that running with the fallback allowed or in forwards-only mode does not increase performance at all compared to standard PyTorch at best, and hinders it at worst, so that it’s actually better to run without torch.jit
and nvFuser
.
This is running on Windows 11 + WSL2 (Ubuntu 20.04 LTS) on AMD Threadripper + NVIDIA RTX A6000. Automatic mixed precision is enabled in these runs but from memory doesn’t make any difference either way.
I would be grateful for any insights anyone may have.