Manually set cudnn convolution algorithm

From other threads I found that,
> `cudnn.benchmark=True` will try different convolution algorithms for each input shape.
So I believe that torch can set the algorithms specifically for each layer individually.

Is there a possibility to set this manually?
From https://core.ac.uk/download/pdf/224976536.pdf I see that we have GEMM, FFT and Winograd.

3 Likes

No, you cannot set the algorithm manually and the benchmark mode will select the fastest found algo.
What’s your use case you want to specify the algorithms manually?

1 Like

Thank you for the quick answer.

My use case is as follows:

I have a model that needs to act on variable input sizes. Therefore benchmark mode can not infer a single best option.

I know that the vast majority of cases lie within a small range of input sizes. So I could run a benchmark on the mode of the distribution and thereby get the fasted algorithm for most of the cases.

I would then manually fix the algorithm choice before going to variable-sized inputs.

Thanks for this information.
At the moment this would be possible by writing a custom CUDA extension and specifying the algo there.
We are currently working on enabling the cudnnV8 API, so feel free to post a feature request on GitHub for it so that we can discuss it there further.

Would you have an example of how to call cudnn functions from a custom CUDA extension? I was looking for how aten::cudnn_convolution is implemented and I couldn’t find anything under the aten folder.

A general cuDNN extension example can be found here and this post might also be useful.

Thank you for sharing these references, on the second link there is the call for the following function.

at::Tensor convolution(
    const at::Tensor& input,
    const at::Tensor& weight,
    const at::Tensor& bias,
    c10::ArrayRef<int64_t> stride,
    c10::ArrayRef<int64_t> padding,
    c10::ArrayRef<int64_t> dilation,
    int64_t groups,
    bool benchmark,
    bool deterministic) {

    return at::cudnn_convolution(
        input,
        weight,
        bias,
        padding,
        stride,
        dilation,
        groups,
        benchmark,
        deterministic); 
}

I am still trying to identify how to select the algorithm used by the cudnn convolution. None of these arguments seems to do so.

I am to be able to select among (a) GEMM, (b) FFT and (c) Winograd.

For the algo selection, you would need to down a layer deeper and call the raw cuDNN calls as seen here.

1 Like

Thank you @ptrblck. I am trying to do a wrap to work if the cudnn functions and set the algorithm. As of now I did the forward pass, as in this file. It seems it is working fine for a given algorithm.

However, when I try find the fastest algorithm by calling the cudnnFindConvolutionForwardAlgorithm, half of the times the program crashes, giving Segmentation fault (core dumped).

Any tip on how to investigate the reason for that?


(base) eduardoj@Worksmart:~/Repo/eduardo4jesus/PyTorch-cuDNN-Convolution$ python example.py 
Using /home/eduardoj/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/eduardoj/.cache/torch_extensions/py39_cu113/cudnn_convolution/build.ninja...
Building extension module cudnn_convolution...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cudnn_convolution...
Compiled and Loaded!
Trying all
Algorithm: CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM
	Status: CUDNN_STATUS_SUCCESS
	Time: 0.26624
	Memory: 0
	MathType: CUDNN_DEFAULT_MATH

Algorithm: CUDNN_CONVOLUTION_FWD_ALGO_GEMM
	Status: CUDNN_STATUS_SUCCESS
	Time: 0.535424
	Memory: 55795728
	MathType: CUDNN_DEFAULT_MATH

Algorithm: CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM
	Status: CUDNN_STATUS_SUCCESS
	Time: 92.4836
	Memory: 1592080
	MathType: CUDNN_DEFAULT_MATH

Algorithm: CUDNN_CONVOLUTION_FWD_ALGO_DIRECT
	Status: CUDNN_STATUS_NOT_SUPPORTED
	Time: -1
	Memory: 0
	MathType: CUDNN_DEFAULT_MATH

Algorithm: CUDNN_CONVOLUTION_FWD_ALGO_FFT
	Status: CUDNN_STATUS_NOT_SUPPORTED
	Time: -1
	Memory: 0
	MathType: CUDNN_DEFAULT_MATH

Algorithm: CUDNN_CONVOLUTION_FWD_ALGO_FFT_TILING
	Status: CUDNN_STATUS_NOT_SUPPORTED
	Time: -1
	Memory: 0
	MathType: CUDNN_DEFAULT_MATH

Algorithm: CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD
	Status: CUDNN_STATUS_NOT_SUPPORTED
	Time: -1
	Memory: 0
	MathType: CUDNN_DEFAULT_MATH

Algorithm: CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED
	Status: CUDNN_STATUS_INTERNAL_ERROR
	Time: -1
	Memory: 92887808
	MathType: CUDNN_DEFAULT_MATH

Allocating Workspace
Workspace size: 0MB
Elapsed Time: 0.278528 ms
Done!
(base) eduardoj@Worksmart:~/Repo/eduardo4jesus/PyTorch-cuDNN-Convolution$ python example.py 
Using /home/eduardoj/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/eduardoj/.cache/torch_extensions/py39_cu113/cudnn_convolution/build.ninja...
Building extension module cudnn_convolution...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cudnn_convolution...
Compiled and Loaded!
Trying all
Segmentation fault (core dumped)
(base) eduardoj@Worksmart:~/Repo/eduardo4jesus/PyTorch-cuDNN-Convolution$ 

Could you check the backtrace of the seg fault and see what exactly is causing it?

Here is the backtrace output.

(base) eduardoj@Worksmart:~/Repo/eduardo4jesus/PyTorch-cuDNN-Convolution$ gdb python
gdb: /home/eduardoj/miniconda3/lib/libtinfo.so.6: no version information available (required by gdb)
gdb: /home/eduardoj/miniconda3/lib/libncursesw.so.6: no version information available (required by gdb)
gdb: /home/eduardoj/miniconda3/lib/libncursesw.so.6: no version information available (required by gdb)
gdb: /home/eduardoj/miniconda3/lib/libncursesw.so.6: no version information available (required by gdb)
GNU gdb (Ubuntu 12.0.90-0ubuntu1) 12.0.90
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...
(gdb) run example.py
Starting program: /home/eduardoj/miniconda3/bin/python example.py
/bin/bash: /home/eduardoj/miniconda3/lib/libtinfo.so.6: no version information available (required by /bin/bash)
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Detaching after fork from child process 8480]
[New Thread 0x7fff17cdd640 (LWP 8486)]
Using /home/eduardoj/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
[Detaching after fork from child process 8487]
[Detaching after fork from child process 8488]
[Detaching after fork from child process 8489]
Detected CUDA files, patching ldflags
Emitting ninja build file /home/eduardoj/.cache/torch_extensions/py39_cu113/cudnn_convolution/build.ninja...
Building extension module cudnn_convolution...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[Detaching after fork from child process 8490]
ninja: no work to do.
Loading extension module cudnn_convolution...
Compiled and Loaded!
[New Thread 0x7fff161fdac0 (LWP 8491)]
[New Thread 0x7fff159fbb40 (LWP 8492)]
[New Thread 0x7fff151f9bc0 (LWP 8493)]
[New Thread 0x7fff149f8640 (LWP 8494)]
[New Thread 0x7fff11fff640 (LWP 8495)]
Trying all

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007fffd73c7eb1 in cublasGetMathMode () from /home/eduardoj/miniconda3/lib/python3.9/site-packages/torch/lib/../../../../libcublas.so.11
(gdb) backtrace
#0  0x00007fffd73c7eb1 in cublasGetMathMode () from /home/eduardoj/miniconda3/lib/python3.9/site-packages/torch/lib/../../../../libcublas.so.11
#1  0x00007ffe2c0b9014 in ?? () from /usr/local/cuda/lib64/libcudnn_cnn_infer.so.8
#2  0x00007ffe2c0beff3 in cudnn::winograd_nonfused::conv2dForward(cudnnContext*, CUstream_st*, void const*, cudnnTensor4dStruct const*, void const*, cudnnFilter4dStruct const*, void const*, cudnnConvolutionStruct const*, void*, unsigned long, bool, void const*, void const*, void const*, cudnnTensor4dStruct const*, void*, bool) () from /usr/local/cuda/lib64/libcudnn_cnn_infer.so.8
#3  0x00007ffe2bde96b9 in cudnn::cnn::WinogradNonfusedEngine<true>::execute_internal_impl(cudnn::backend::VariantPack const&, CUstream_st*) ()
   from /usr/local/cuda/lib64/libcudnn_cnn_infer.so.8
#4  0x00007ffe2b9d3e71 in cudnn::cnn::EngineInterface::execute(cudnn::backend::VariantPack const&, CUstream_st*) ()
   from /usr/local/cuda/lib64/libcudnn_cnn_infer.so.8
#5  0x00007ffe2bb46b1e in cudnn::cnn::AutoTransformationExecutor::execute_pipeline(cudnn::cnn::EngineInterface&, cudnn::backend::VariantPack const&, CUstream_st*) const () from /usr/local/cuda/lib64/libcudnn_cnn_infer.so.8
#6  0x00007ffe2bb46c23 in cudnn::cnn::BatchPartitionExecutor::operator()(cudnn::cnn::EngineInterface&, cudnn::cnn::EngineInterface*, cudnn::backend::VariantPack const&, CUstream_st*) const () from /usr/local/cuda/lib64/libcudnn_cnn_infer.so.8
#7  0x00007ffe2bb62466 in cudnn::cnn::GeneralizedConvolutionEngine<cudnn::cnn::WinogradNonfusedEngine<true> >::execute_internal_impl(cudnn::backend::VariantPack const&, CUstream_st*) () from /usr/local/cuda/lib64/libcudnn_cnn_infer.so.8
#8  0x00007ffe2b9d3e71 in cudnn::cnn::EngineInterface::execute(cudnn::backend::VariantPack const&, CUstream_st*) ()
   from /usr/local/cuda/lib64/libcudnn_cnn_infer.so.8
#9  0x00007ffe2b9ef9e8 in cudnn::backend::execute(cudnnContext*, cudnn::backend::ExecutionPlan&, cudnn::backend::VariantPack&) ()
   from /usr/local/cuda/lib64/libcudnn_cnn_infer.so.8
#10 0x00007ffe2bcd3a19 in cudnn::backend::EnginesAlgoMap<cudnnConvolutionFwdAlgo_t, 8>::execute_wrapper(cudnnContext*, cudnnConvolutionFwdAlgo_t, cudnn::backend::ExecutionPlan&, cudnn::backend::VariantPack&) () from /usr/local/cuda/lib64/libcudnn_cnn_infer.so.8
#11 0x00007ffe2bcc6aa8 in cudnn::backend::convolutionForward(cudnnContext*, void const*, cudnnTensorStruct const*, void const*, cudnnFilterStruct const*, void const*, cudnnConvolutionStruct const*, cudnnConvolutionFwdAlgo_t, void*, unsigned long, bool, void const*, void const*, void const*, cudnnActivationStruct const*, cudnnTensorStruct const*, void*) () from /usr/local/cuda/lib64/libcudnn_cnn_infer.so.8
#12 0x00007ffe2be2d839 in cudnn::cnn::convolutionForward(cudnnContext*, void const*, cudnnTensorStruct const*, void const*, cudnnFilterStruct const*, void const*, cudnnConvolutionStruct const*, cudnnConvolutionFwdAlgo_t, void*, unsigned long, void const*, cudnnTensorStruct const*, void*) ()
   from /usr/local/cuda/lib64/libcudnn_cnn_infer.so.8
#13 0x00007ffe2be3f426 in ?? () from /usr/local/cuda/lib64/libcudnn_cnn_infer.so.8
#14 0x00007ffe2be3fe34 in cudnnStatus_t findAlgorithm<find_get_conv_params, cudnnConvolutionStruct, cudnnConvolutionFwdAlgo_t, cudnnConvolutionFwdAlgoPerfStruct, 8, true>(find_get_conv_params, int, int*, cudnnConvolutionFwdAlgoPerfStruct*) () from /usr/local/cuda/lib64/libcudnn_cnn_infer.so.8
#15 0x00007ffe2be2dbc9 in cudnnFindConvolutionForwardAlgorithm () from /usr/local/cuda/lib64/libcudnn_cnn_infer.so.8
#16 0x00007fff17480e87 in convolution(int, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool) () from /home/eduardoj/.cache/torch_extensions/py39_cu113/cudnn_convolution/cudnn_convolution.so
#17 0x00007fff174abce8 in at::Tensor pybind11::detail::argument_loader<int, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool>::call_impl<at::Tensor, at::Tensor (*&)(int, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool), 0ul, 1ul, 2ul, 3ul, 4ul, 5ul, 6ul, 7ul, 8ul, 9ul, 10ul, pybind11::detail::void_type>(at::Tensor (*&)(int, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool), std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul, 4ul, 5ul, 6ul, 7ul, 8ul, 9ul, 10ul>, pybind11::detail::void_type&&) && () from /home/eduardoj/.cache/torch_extensions/py39_cu113/cudnn_convolution/cudnn_convolution.so
--Type <RET> for more, q to quit, c to continue without paging--c
#18 0x00007fff174a7213 in std::enable_if<!std::is_void<at::Tensor>::value, at::Tensor>::type pybind11::detail::argument_loader<int, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool>::call<at::Tensor, pybind11::detail::void_type, at::Tensor (*&)(int, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool)>(at::Tensor (*&)(int, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool)) && () from /home/eduardoj/.cache/torch_extensions/py39_cu113/cudnn_convolution/cudnn_convolution.so
#19 0x00007fff174a1bac in pybind11::cpp_function::initialize<at::Tensor (*&)(int, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool), at::Tensor, int, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool, pybind11::name, pybind11::scope, pybind11::sibling, char [12]>(at::Tensor (*&)(int, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool), at::Tensor (*)(int, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, char const (&) [12])::{lambda(pybind11::detail::function_call&)#3}::operator()(pybind11::detail::function_call&) const () from /home/eduardoj/.cache/torch_extensions/py39_cu113/cudnn_convolution/cudnn_convolution.so
#20 0x00007fff174a2238 in pybind11::cpp_function::initialize<at::Tensor (*&)(int, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool), at::Tensor, int, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool, pybind11::name, pybind11::scope, pybind11::sibling, char [12]>(at::Tensor (*&)(int, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool), at::Tensor (*)(int, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, char const (&) [12])::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) () from /home/eduardoj/.cache/torch_extensions/py39_cu113/cudnn_convolution/cudnn_convolution.so
#21 0x00007fff1748faae in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) () from /home/eduardoj/.cache/torch_extensions/py39_cu113/cudnn_convolution/cudnn_convolution.so
#22 0x00005555556a290c in cfunction_call (func=0x7fff1df514f0, args=<optimized out>, kwargs=<optimized out>) at /usr/local/src/conda/python-3.9.13/Objects/methodobject.c:543
#23 0x0000555555688fa7 in _PyObject_MakeTpCall (tstate=0x55555590fd50, callable=0x7fff1df514f0, args=<optimized out>, nargs=<optimized out>, keywords=<optimized out>) at /usr/local/src/conda/python-3.9.13/Objects/call.c:191
#24 0x0000555555684d5f in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=<optimized out>, callable=0x7fff1df514f0, tstate=<optimized out>) at /usr/local/src/conda/python-3.9.13/Include/cpython/abstract.h:116
#25 _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x55555596a8c8, callable=0x7fff1df514f0, tstate=<optimized out>) at /usr/local/src/conda/python-3.9.13/Include/cpython/abstract.h:103
#26 PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x55555596a8c8, callable=0x7fff1df514f0) at /usr/local/src/conda/python-3.9.13/Include/cpython/abstract.h:127
#27 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x55555590fd50) at /usr/local/src/conda/python-3.9.13/Python/ceval.c:5077
#28 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=<optimized out>, throwflag=<optimized out>) at /usr/local/src/conda/python-3.9.13/Python/ceval.c:3489
#29 0x000055555567ea17 in _PyEval_EvalFrame (throwflag=0, f=0x55555596a750, tstate=0x55555590fd50) at /usr/local/src/conda/python-3.9.13/Include/internal/pycore_ceval.h:40
#30 _PyEval_EvalCode (tstate=<optimized out>, _co=0x7ffff6dd2870, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kwnames=<optimized out>, kwargs=<optimized out>, kwcount=<optimized out>, kwstep=<optimized out>, defs=<optimized out>, defcount=<optimized out>, kwdefs=<optimized out>, closure=<optimized out>, name=<optimized out>, qualname=<optimized out>) at /usr/local/src/conda/python-3.9.13/Python/ceval.c:4329
#31 0x000055555567e6d7 in _PyEval_EvalCodeWithName (_co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kwnames=<optimized out>, kwargs=0x0, kwcount=0, kwstep=2, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at /usr/local/src/conda/python-3.9.13/Python/ceval.c:4361
#32 0x000055555567e689 in PyEval_EvalCodeEx (_co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kws=<optimized out>, kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0) at /usr/local/src/conda/python-3.9.13/Python/ceval.c:4377
#33 0x0000555555739e3b in PyEval_EvalCode (co=co@entry=0x7ffff6dd2870, globals=globals@entry=0x7ffff6dc7b00, locals=locals@entry=0x7ffff6dc7b00) at /usr/local/src/conda/python-3.9.13/Python/ceval.c:828
#34 0x00005555557684a9 in run_eval_code_obj (tstate=0x55555590fd50, co=0x7ffff6dd2870, globals=0x7ffff6dc7b00, locals=0x7ffff6dc7b00) at /usr/local/src/conda/python-3.9.13/Python/pythonrun.c:1221
#35 0x0000555555764694 in run_mod (mod=<optimized out>, filename=<optimized out>, globals=0x7ffff6dc7b00, locals=0x7ffff6dc7b00, flags=<optimized out>, arena=<optimized out>) at /usr/local/src/conda/python-3.9.13/Python/pythonrun.c:1242
#36 0x00005555555e96d2 in pyrun_file (fp=0x55555590d4f0, filename=0x7ffff6d69eb0, start=<optimized out>, globals=0x7ffff6dc7b00, locals=0x7ffff6dc7b00, closeit=1, flags=0x7fffffffdf38) at /usr/local/src/conda/python-3.9.13/Python/pythonrun.c:1140
#37 0x000055555575e1f2 in pyrun_simple_file (flags=0x7fffffffdf38, closeit=1, filename=0x7ffff6d69eb0, fp=0x55555590d4f0) at /usr/local/src/conda/python-3.9.13/Python/pythonrun.c:450
#38 PyRun_SimpleFileExFlags (fp=0x55555590d4f0, filename=<optimized out>, closeit=1, flags=0x7fffffffdf38) at /usr/local/src/conda/python-3.9.13/Python/pythonrun.c:483
#39 0x000055555575b533 in pymain_run_file (cf=0x7fffffffdf38, config=0x55555590e490) at /usr/local/src/conda/python-3.9.13/Modules/main.c:377
#40 pymain_run_python (exitcode=0x7fffffffdf30) at /usr/local/src/conda/python-3.9.13/Modules/main.c:602
#41 Py_RunMain () at /usr/local/src/conda/python-3.9.13/Modules/main.c:681
#42 0x000055555572db79 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at /usr/local/src/conda/python-3.9.13/Modules/main.c:1101
#43 0x00007ffff7cabd90 in __libc_start_call_main (main=main@entry=0x55555572db30 <main>, argc=argc@entry=2, argv=argv@entry=0x7fffffffe168) at ../sysdeps/nptl/libc_start_call_main.h:58
#44 0x00007ffff7cabe40 in __libc_start_main_impl (main=0x55555572db30 <main>, argc=2, argv=0x7fffffffe168, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe158) at ../csu/libc-start.c:392
#45 0x000055555572da81 in _start ()

Thanks for sharing it. It seems cublas is failing via cuDNN.
Could you post an executable code to reproduce the issue as well as the output of python -m torch.utils.collect_env?

Thank you for taking a look at this.

Here is the output of python -m torch.utils.collect_env:

(base) eduardoj@Worksmart:~/Repo/eduardo4jesus/PyTorch-cuDNN-Convolution$ python -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 1.11.0
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04 LTS (x86_64)
GCC version: (Ubuntu 10.3.0-15ubuntu1) 10.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:56:21)  [GCC 10.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-37-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 11.7.64
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3070 Laptop GPU
Nvidia driver version: 515.48.07
cuDNN version: Probably one of the following:
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudnn.so.8.4.1
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.4.1
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.4.1
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.4.1
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.4.1
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.4.1
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.4.1
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.22.3
[pip3] torch==1.11.0
[pip3] torchaudio==0.11.0
[pip3] torchvision==0.12.0
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               11.3.1               h2bc3f7f_2  
[conda] libblas                   3.9.0            12_linux64_mkl    conda-forge
[conda] libcblas                  3.9.0            12_linux64_mkl    conda-forge
[conda] liblapack                 3.9.0            12_linux64_mkl    conda-forge
[conda] liblapacke                3.9.0            12_linux64_mkl    conda-forge
[conda] mkl                       2021.4.0           h06a4308_640  
[conda] mkl-service               2.4.0            py39h7f8727e_0  
[conda] mkl_fft                   1.3.1            py39hd3c417c_0  
[conda] mkl_random                1.2.2            py39h51133e4_0  
[conda] numpy                     1.22.3           py39he7a7128_0  
[conda] numpy-base                1.22.3           py39hf524024_0  
[conda] pytorch                   1.11.0          py3.9_cuda11.3_cudnn8.2.0_0    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torchaudio                0.11.0               py39_cu113    pytorch
[conda] torchvision               0.12.0               py39_cu113    pytorch
(base) eduardoj@Worksmart:~/Repo/eduardo4jesus/PyTorch-cuDNN-Convolution$ 

You can find the code I am executing here. But, I am also posting parts of it bellow:

example.py

import torch
from torch.utils.cpp_extension import load

# load the PyTorch extension
cudnn_convolution = load(
  name="cudnn_convolution",
  sources=["cudnn_convolution.cpp"],
  extra_ldflags = ["-lcudnn"],
  with_cuda=True,
  verbose=True
)
print("Compiled and Loaded!")


# create dummy input, convolutional weights and bias
B, F, C = 8, 32, 3
N, K, O = 32, 5, 28
input  = torch.zeros(B, C, N, N).to('cuda')
weight = torch.zeros(F, C, K, K).to('cuda')
output = torch.zeros(B, F, O, O).to('cuda')

stride   = (1, 1)
padding  = (0, 0)
dilation = (1, 1)
groups   = 1

output = cudnn_convolution.convolution(
  -1, # FWD_ALGO. (-1: Tries all methods and pick the fastest)
  input, weight, output, stride, padding, dilation, groups, True
  )

print("Done!")

cudnn_convolution.cpp

/**
 * The #include<ATen/cudnn/*.h> needs guards as pointed in
 * https://github.com/pytorch/pytorch/tree/master/aten/src/ATen/cudnn
 */
#include <ATen/cuda/CUDAConfig.h> // for the definition of AT_CUDNN_ENABLED
#if AT_CUDNN_ENABLED()

#include <cudnn.h>
#include <torch/extension.h>
#include <ATen/cudnn/Handle.h> // for getCudnnHandle
#include <ATen/cudnn/Types.h>      // for getCudnnDataType

#define checkCUDNN(expression)                               \
  {                                                          \
    cudnnStatus_t status = (expression);                     \
    if (status != CUDNN_STATUS_SUCCESS)                      \
    {                                                        \
      std::cerr << "Error on line " << __LINE__ << ": "      \
                << cudnnGetErrorString(status) << std::endl; \
      std::exit(EXIT_FAILURE);                               \
    }                                                        \
  }

std::ostream& operator<<(std::ostream &out, const cudnnConvolutionFwdAlgoPerf_t &fwdAlgoPert);
std::ostream& operator<<(std::ostream &out, const cudnnConvolutionBwdFilterAlgoPerf_t &bwdFilterAlgoPerf);
std::ostream& operator<<(std::ostream &out, const cudnnConvolutionBwdDataAlgoPerf_t &bwdDataAlgoPerf);

typedef struct _cudnnDescriptors_t_
{
  cudnnTensorDescriptor_t input, output;
  cudnnFilterDescriptor_t weight;
  cudnnConvolutionDescriptor_t convolution;

  virtual ~_cudnnDescriptors_t_()
  {
    cudnnDestroyTensorDescriptor(input);
    cudnnDestroyTensorDescriptor(output);
    cudnnDestroyFilterDescriptor(weight);
    cudnnDestroyConvolutionDescriptor(convolution);
  }
} cudnnDescriptors_t;

void initialize_descriptors(const at::Tensor &input, const at::Tensor &weight, const at::Tensor &output,
                                          c10::ArrayRef<int64_t> &stride,
                                          c10::ArrayRef<int64_t> &padding,
                                          c10::ArrayRef<int64_t> &dilation,
                                          cudnnDescriptors_t &descriptors);


at::Tensor convolution(const int fwdAlgo,
                       const at::Tensor &input, const at::Tensor &weight, const at::Tensor &output,
                       c10::ArrayRef<int64_t> stride, c10::ArrayRef<int64_t> padding,
                       c10::ArrayRef<int64_t> dilation, int64_t groups, bool verbose)
{
  const cudnnHandle_t cudnn = at::native::getCudnnHandle();
  cudnnDescriptors_t desc;
  initialize_descriptors(input, weight, output, stride, padding, dilation, desc);

  /*****************************************************************************
   * 2. Setting FWD Convolution Algo
   ****************************************************************************/
  cudnnConvolutionFwdAlgoPerf_t convolution_algorithm[CUDNN_CONVOLUTION_FWD_ALGO_COUNT];
  int returnedAlgoCount;

  /**
   * TODO: I frequently get segmentation fault when finding the convolution
   * algorithms. I am not sure how to fix it.
   */
  if (fwdAlgo == -1)
  {
    std::cout << "Trying all" << std::endl;
    checkCUDNN(
        cudnnFindConvolutionForwardAlgorithm(/*handle*/ cudnn,
                                             /*xDesc*/ desc.input,
                                             /*wDesc*/ desc.weight,
                                             /*convDesc*/ desc.convolution,
                                             /*yDesc*/ desc.output,
                                             /*requestedAlgoCount*/ CUDNN_CONVOLUTION_FWD_ALGO_COUNT,
                                             /*returnedAlgoCount*/ &returnedAlgoCount,
                                             /*perfResults*/ convolution_algorithm));
    if (verbose)
      for (int i = 0; i < returnedAlgoCount; i++)
        std::cout << convolution_algorithm[i] << std::endl;
  }
  else
  {
    convolution_algorithm[0].algo = static_cast<cudnnConvolutionFwdAlgo_t>(fwdAlgo);
    convolution_algorithm[0].status = static_cast<cudnnStatus_t>(0);
    convolution_algorithm[0].time = -1;
    convolution_algorithm[0].memory = 0;
    convolution_algorithm[0].determinism = static_cast<cudnnDeterminism_t>(-1);
    convolution_algorithm[0].mathType = static_cast<cudnnMathType_t>(0);
    if (verbose)
    {
      std::cout << "Attempt with defined Algo:" << std::endl;
      std::cout << convolution_algorithm[0] << std::endl;
    }
  }

  /*****************************************************************************
   * 3. Get and Allocate Memory for Workspace
   ****************************************************************************/
  if (verbose)
    std::cout << "Allocating Workspace" << std::endl;

  size_t workspace_bytes{0};
  checkCUDNN(cudnnGetConvolutionForwardWorkspaceSize(cudnn,
                                                     /*xDesc*/ desc.input,
                                                     /*wDesc*/ desc.weight,
                                                     /*convDesc*/ desc.convolution,
                                                     /*yDesc*/ desc.output,
                                                     /*algo*/ convolution_algorithm[0].algo,
                                                     /*sizeInBytes*/ &workspace_bytes));

  if (verbose)
    std::cout << "Workspace size: " << (workspace_bytes) << " Bytes" << std::endl;

  void *d_workspace{nullptr};
  cudaMalloc(&d_workspace, workspace_bytes);

  /*****************************************************************************
   * 4. Get and Allocate Memory for Workspace
   ****************************************************************************/
  cudaEvent_t start, stop;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);

  cudaEventRecord(start);
  const float alpha = 1.0f, beta = 0.0f;
  checkCUDNN(cudnnConvolutionForward(cudnn,
                                     /*alpha*/ &alpha,
                                     /*xDesc*/ desc.input,
                                     /*x*/ input.data_ptr(),
                                     /*wDesc*/ desc.weight,
                                     /*w*/ weight.data_ptr(),
                                     /*convDesc*/ desc.convolution,
                                     /*algo*/ convolution_algorithm[0].algo,
                                     /*workSpace*/ d_workspace,
                                     /*workSpaceSizeInBytes*/ workspace_bytes,
                                     /*beta*/ &beta,
                                     /*yDesc*/ desc.output,
                                     /*y*/ output.data_ptr()));
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  float milliseconds{0};
  cudaEventElapsedTime(&milliseconds, start, stop);
  if (verbose)
    std::cout << "Elapsed Time: " << milliseconds << " ms" << std::endl;

  /*****************************************************************************
   * 5. Freeing variables
   ****************************************************************************/
  cudaFree(d_workspace);
  return output;
}


std::ostream &operator<<(std::ostream &out, const cudnnConvolutionFwdAlgo_t &algo)
{
  out << "FWD Algorithm: ";
  switch (algo)
  {
  case CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM:
    out << "CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM";
    break;
  case CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM:
    out << "CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM";
    break;
  case CUDNN_CONVOLUTION_FWD_ALGO_GEMM:
    out << "CUDNN_CONVOLUTION_FWD_ALGO_GEMM";
    break;
  case CUDNN_CONVOLUTION_FWD_ALGO_DIRECT:
    out << "CUDNN_CONVOLUTION_FWD_ALGO_DIRECT";
    break;
  case CUDNN_CONVOLUTION_FWD_ALGO_FFT:
    out << "CUDNN_CONVOLUTION_FWD_ALGO_FFT";
    break;
  case CUDNN_CONVOLUTION_FWD_ALGO_FFT_TILING:
    out << "CUDNN_CONVOLUTION_FWD_ALGO_FFT_TILING";
    break;
  case CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD:
    out << "CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD";
    break;
  case CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED:
    out << "CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED";
    break;
  case CUDNN_CONVOLUTION_FWD_ALGO_COUNT:
    out << "CUDNN_CONVOLUTION_FWD_ALGO_COUNT";
  default:
    std::cerr << "Invalid value FWD Algorithm" << std::endl;
    exit(1);
  }
  return out;
}

std::ostream &operator<<(std::ostream &out, const cudnnStatus_t &status) {
  out << "Status: ";
  switch (status)
  {
    case CUDNN_STATUS_SUCCESS:
      out << "CUDNN_STATUS_SUCCESS";
      break;
    case CUDNN_STATUS_NOT_INITIALIZED:
      out << "CUDNN_STATUS_NOT_INITIALIZED";
      break;
    case CUDNN_STATUS_ALLOC_FAILED:
      out << "CUDNN_STATUS_ALLOC_FAILED";
      break;
    case CUDNN_STATUS_BAD_PARAM:
      out << "CUDNN_STATUS_BAD_PARAM";
      break;
    case CUDNN_STATUS_INTERNAL_ERROR:
      out << "CUDNN_STATUS_INTERNAL_ERROR";
      break;
    case CUDNN_STATUS_INVALID_VALUE:
      out << "CUDNN_STATUS_INVALID_VALUE";
      break;
    case CUDNN_STATUS_ARCH_MISMATCH:
      out << "CUDNN_STATUS_ARCH_MISMATCH";
      break;
    case CUDNN_STATUS_MAPPING_ERROR:
      out << "CUDNN_STATUS_MAPPING_ERROR";
      break;
    case CUDNN_STATUS_EXECUTION_FAILED:
      out << "CUDNN_STATUS_EXECUTION_FAILED";
      break;
    case CUDNN_STATUS_NOT_SUPPORTED:
      out << "CUDNN_STATUS_NOT_SUPPORTED";
      break;
    case CUDNN_STATUS_LICENSE_ERROR:
      out << "CUDNN_STATUS_LICENSE_ERROR";
      break;
    case CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING:
      out << "CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING";
      break;
    case CUDNN_STATUS_RUNTIME_IN_PROGRESS:
      out << "CUDNN_STATUS_RUNTIME_IN_PROGRESS";
      break;
    case CUDNN_STATUS_RUNTIME_FP_OVERFLOW:
      out << "CUDNN_STATUS_RUNTIME_FP_OVERFLOW";
      break;
    case CUDNN_STATUS_VERSION_MISMATCH:
      out << "CUDNN_STATUS_VERSION_MISMATCH";
      break;
    default:
      std::cerr << "Invalid value Status Value" << std::endl;
      exit(1);
  }
  return out;
}

std::ostream &operator<<(std::ostream &out, const cudnnDeterminism_t &determinism) {
  out << "Determinism: ";
  switch (determinism)
  {
    case CUDNN_NON_DETERMINISTIC:
      out << "CUDNN_NON_DETERMINISTIC";
      break;
    case CUDNN_DETERMINISTIC:
      out << "CUDNN_DETERMINISTIC";
      break;
    default:
      std::cerr << "Underfined Value: " << static_cast<int>(determinism);
  }
  return out;
}

std::ostream &operator<<(std::ostream &out, const cudnnMathType_t &mathType) {
  out << "MathType: ";
  switch (mathType)
  {
  case CUDNN_DEFAULT_MATH:
    out << "CUDNN_DEFAULT_MATH";
    break;
  case CUDNN_TENSOR_OP_MATH:
    out << "CUDNN_TENSOR_OP_MATH";
    break;
  case CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION:
    out << "CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION";
    break;
  case CUDNN_FMA_MATH:
    out << "CUDNN_FMA_MATH";
    break;
  default:
      std::cerr << "Invalid (" << mathType 
      << ")value Algorithm Memory Type" << std::endl;
      exit(1);
  }
  return out;
}

std::ostream& operator<<(std::ostream &out, const cudnnConvolutionFwdAlgoPerf_t &fwdAlgoPert) {
  out << fwdAlgoPert.algo;
  out << "\n\t" << fwdAlgoPert.status;
  out << "\n\tTime: " << fwdAlgoPert.time;
  out << "\n\tMemory: " << fwdAlgoPert.memory;
  out << "\n\t" << fwdAlgoPert.determinism;
  out << "\n\t" << fwdAlgoPert.mathType;
  out << std::endl;
  return out;
}

std::ostream& operator<<(std::ostream &out, const cudnnConvolutionBwdFilterAlgoPerf_t &bwdFilterAlgoPerf) {
  out << bwdFilterAlgoPerf.algo;
  out << "\n\t" << bwdFilterAlgoPerf.status;
  out << "\n\tTime: " << bwdFilterAlgoPerf.time;
  out << "\n\tMemory: " << bwdFilterAlgoPerf.memory;
  out << "\n\t" << bwdFilterAlgoPerf.determinism;
  out << "\n\t" << bwdFilterAlgoPerf.mathType;
  out << std::endl;
  return out;
}

std::ostream& operator<<(std::ostream &out, const cudnnConvolutionBwdDataAlgoPerf_t &bwdDataAlgoPerf) {
  out << bwdDataAlgoPerf.algo;
  out << "\n\t" << bwdDataAlgoPerf.status;
  out << "\n\tTime: " << bwdDataAlgoPerf.time;
  out << "\n\tMemory: " << bwdDataAlgoPerf.memory;
  out << "\n\t" << bwdDataAlgoPerf.determinism;
  out << "\n\t" << bwdDataAlgoPerf.mathType;
  out << std::endl;
  return out;
}

void initialize_descriptors(const at::Tensor &input, const at::Tensor &weight, const at::Tensor &output,
                            c10::ArrayRef<int64_t> &stride,
                            c10::ArrayRef<int64_t> &padding,
                            c10::ArrayRef<int64_t> &dilation,
                            cudnnDescriptors_t &desc)
{
  /*****************************************************************************
   * 1. Initializing Descriptors
   ****************************************************************************/
  assert(input.dim() == 4);
  checkCUDNN(cudnnCreateTensorDescriptor(&desc.input));
  checkCUDNN(cudnnSetTensor4dDescriptor(desc.input,
                                        /*format=*/CUDNN_TENSOR_NCHW,
                                        /*dataType=*/at::native::getCudnnDataTypeFromScalarType(input.scalar_type()),
                                        /*batch_size=*/input.size(0),
                                        /*channels=*/input.size(1),
                                        /*image_height=*/input.size(2),
                                        /*image_width=*/input.size(3)));

  assert(weight.dim() == 4);
  checkCUDNN(cudnnCreateFilterDescriptor(&desc.weight));
  checkCUDNN(cudnnSetFilter4dDescriptor(desc.weight,
                                        /*dataType=*/at::native::getCudnnDataTypeFromScalarType(weight.scalar_type()),
                                        /*format=*/CUDNN_TENSOR_NCHW,
                                        /*out_channels=*/weight.size(0),
                                        /*in_channels=*/weight.size(1),
                                        /*kernel_height=*/weight.size(2),
                                        /*kernel_width=*/weight.size(3)));

  checkCUDNN(cudnnCreateConvolutionDescriptor(&desc.convolution));
  checkCUDNN(cudnnSetConvolution2dDescriptor(desc.convolution,
                                             /*pad_height=*/padding[0],
                                             /*pad_width=*/padding[1],
                                             /*vertical_stride=*/stride[0],
                                             /*horizontal_stride=*/stride[1],
                                             /*dilation_height=*/dilation[0],
                                             /*dilation_width=*/dilation[1],
                                             /*mode=*/CUDNN_CROSS_CORRELATION,
                                             /*computeType=*/at::native::getCudnnDataTypeFromScalarType(output.scalar_type())));

  int batch_size{0}, channels{0}, height{0}, width{0};
  checkCUDNN(cudnnGetConvolution2dForwardOutputDim(desc.convolution,
                                                   desc.input,
                                                   desc.weight,
                                                   &batch_size,
                                                   &channels,
                                                   &height,
                                                   &width));

  assert(batch_size == output.size(0) && channels == output.size(1) &&
    height == output.size(2) && width == output.size(3));

  checkCUDNN(cudnnCreateTensorDescriptor(&desc.output));
  checkCUDNN(cudnnSetTensor4dDescriptor(desc.output,
                                        /*format=*/CUDNN_TENSOR_NCHW,
                                        /*dataType=*/at::native::getCudnnDataTypeFromScalarType(output.scalar_type()),
                                        /*batch_size=*/batch_size,
                                        /*channels=*/channels,
                                        /*image_height=*/height,
                                        /*image_width=*/width));
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
{
  m.def("convolution", &convolution, "convolution");
}

#endif

I noticed that I had the wrong format for some of the tensors. Now, I have all of them as CUDNN_TENSOR_NCHW. It did show CUDNN_CONVOLUTION_FWD_ALGO_FFT as supported now, however, I am still getting a Segmentation fault (core dumped) error.

The backtracking is now different though:

gdb backtracing:*

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
___pthread_mutex_lock (mutex=0x0) at ./nptl/pthread_mutex_lock.c:80
80	./nptl/pthread_mutex_lock.c: No such file or directory.
(gdb) br
Breakpoint 1 at 0x7ffff7d19f74: file ./nptl/pthread_mutex_lock.c, line 80.
(gdb) bt
#0  ___pthread_mutex_lock (mutex=0x0) at ./nptl/pthread_mutex_lock.c:80
#1  0x00007ffe6065cb1e in cudnn::ops::GetInternalStreams(cudnnContext*, int, CUstream_st**) () from /usr/local/cuda/lib64/libcudnn_ops_infer.so.8
#2  0x00007ffe35df22c5 in cudnn::cnn::Fft_2d_Non_Tiled<float>::execute_internal_impl(cudnn::backend::VariantPack const&, CUstream_st*) ()
   from /usr/local/cuda/lib64/libcudnn_cnn_infer.so.8
#3  0x00007ffe359d3e71 in cudnn::cnn::EngineInterface::execute(cudnn::backend::VariantPack const&, CUstream_st*) ()
   from /usr/local/cuda/lib64/libcudnn_cnn_infer.so.8
#4  0x00007ffe359fcb75 in cudnn::cnn::EngineContainer<(cudnnBackendEngineName_t)3, 4096ul>::execute_internal_impl(cudnn::backend::VariantPack const&, CUstream_st*) () from /usr/local/cuda/lib64/libcudnn_cnn_infer.so.8
#5  0x00007ffe359d3e71 in cudnn::cnn::EngineInterface::execute(cudnn::backend::VariantPack const&, CUstream_st*) ()
   from /usr/local/cuda/lib64/libcudnn_cnn_infer.so.8
#6  0x00007ffe35b46b1e in cudnn::cnn::AutoTransformationExecutor::execute_pipeline(cudnn::cnn::EngineInterface&, cudnn::backend::VariantPack const&, CUstream_st*) const () from /usr/local/cuda/lib64/libcudnn_cnn_infer.so.8
#7  0x00007ffe35b46c23 in cudnn::cnn::BatchPartitionExecutor::operator()(cudnn::cnn::EngineInterface&, cudnn::cnn::EngineInterface*, cudnn::backend::VariantPack const&, CUstream_st*) const () from /usr/local/cuda/lib64/libcudnn_cnn_infer.so.8
#8  0x00007ffe35b59d66 in cudnn::cnn::GeneralizedConvolutionEngine<cudnn::cnn::EngineContainer<(cudnnBackendEngineName_t)3, 4096ul> >::execute_internal_impl(cudnn::backend::VariantPack const&, CUstream_st*) () from /usr/local/cuda/lib64/libcudnn_cnn_infer.so.8
#9  0x00007ffe359d3e71 in cudnn::cnn::EngineInterface::execute(cudnn::backend::VariantPack const&, CUstream_st*) ()
   from /usr/local/cuda/lib64/libcudnn_cnn_infer.so.8
#10 0x00007ffe359ef9e8 in cudnn::backend::execute(cudnnContext*, cudnn::backend::ExecutionPlan&, cudnn::backend::VariantPack&) ()
   from /usr/local/cuda/lib64/libcudnn_cnn_infer.so.8
#11 0x00007ffe35cd3a19 in cudnn::backend::EnginesAlgoMap<cudnnConvolutionFwdAlgo_t, 8>::execute_wrapper(cudnnContext*, cudnnConvolutionFwdAlgo_t, cudnn::backend::ExecutionPlan&, cudnn::backend::VariantPack&) () from /usr/local/cuda/lib64/libcudnn_cnn_infer.so.8
#12 0x00007ffe35cc6aa8 in cudnn::backend::convolutionForward(cudnnContext*, void const*, cudnnTensorStruct const*, void const*, cudnnFilterStruct const*, void const*, cudnnConvolutionStruct const*, cudnnConvolutionFwdAlgo_t, void*, unsigned long, bool, void const*, void const*, void const*, cudnnActivationStruct const*, cudnnTensorStruct const*, void*) () from /usr/local/cuda/lib64/libcudnn_cnn_infer.so.8
#13 0x00007ffe35e2d839 in cudnn::cnn::convolutionForward(cudnnContext*, void const*, cudnnTensorStruct const*, void const*, cudnnFilterStruct const*, void const*, cudnnConvolutionStruct const*, cudnnConvolutionFwdAlgo_t, void*, unsigned long, void const*, cudnnTensorStruct const*, void*) ()
   from /usr/local/cuda/lib64/libcudnn_cnn_infer.so.8
#14 0x00007ffe35e3f426 in ?? () from /usr/local/cuda/lib64/libcudnn_cnn_infer.so.8
#15 0x00007ffe35e3ff1c in cudnnStatus_t findAlgorithm<find_get_conv_params, cudnnConvolutionStruct, cudnnConvolutionFwdAlgo_t, cudnnConvolutionFwdAlgoPerfStruct, 8, true>(find_get_conv_params, int, int*, cudnnConvolutionFwdAlgoPerfStruct*) () from /usr/local/cuda/lib64/libcudnn_cnn_infer.so.8
#16 0x00007ffe35e2dbc9 in cudnnFindConvolutionForwardAlgorithm () from /usr/local/cuda/lib64/libcudnn_cnn_infer.so.8
#17 0x00007fff17488d75 in convolution(int, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool) () from /home/eduardoj/.cache/torch_extensions/py39_cu113/cudnn_convolution/cudnn_convolution.so
#18 0x00007fff174b3b26 in at::Tensor pybind11::detail::argument_loader<int, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool>::call_impl<at::Tensor, at::Tensor (*&)(int, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool), 0ul, 1ul, 2ul, 3ul, 4ul, 5ul, 6ul, 7ul, 8ul, pybind11::detail::void_type>(at::Tensor (*&)(int, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool), std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul, 4ul, 5ul, 6ul, 7ul, 8ul>, pybind11::detail::void_type&&) && ()
   from /home/eduardoj/.cache/torch_extensions/py39_cu113/cudnn_convolution/cudnn_convolution.so
#19 0x00007fff174af5bf in std::enable_if<!std::is_void<at::Tensor>::value, at::Tensor>::type pybind11::detail::argument_loader<int, at::Tensor const&, at::Te--Type <RET> for more, q to quit, c to continue without paging--

Hi @ptrblck,

I am still facing this Segmentation Fault back tracing when using CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED.

I noticed that I have two different versions of the libcublas. It appears cudnn is invoking the one used by torch, instead of the one installed with cudnn.

Is there a way to have cudnn pointing to /usr/local/cuda/lib64/libcublas.so.11 instead of ~/miniconda3/lib/libcublas.so.11?

I think it might depend how you are building the extension and which libraries are found/used.
Given it seems you are hitting a cuDNN issue, it might be easier to rip out the code and write a standalone cuDNN version first using their docs and examples to make sure you are not mixing up different issues.

I did have standalone cuDNN code (in here) that works just fine, including for CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD.

At this point I am looking for a way to change this

cudnn_convolution = load(
  name="cudnn_convolution",
  sources=["cudnn_convolution.cpp", "cudnn_utils.cpp"],
  extra_ldflags = ["-lcudnn"],
  with_cuda=True,
  verbose=True
)

So that, when compiling, I wouldn’t have miniconda/libs preceding cuda/libs64.

[1/3] c++ -MMD -MF cudnn_utils.o.d -DTORCH_EXTENSION_NAME=cudnn_convolution -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/eduardoj/miniconda3/lib/python3.9/site-packages/torch/include -isystem /home/eduardoj/miniconda3/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/eduardoj/miniconda3/lib/python3.9/site-packages/torch/include/TH -isystem /home/eduardoj/miniconda3/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/eduardoj/miniconda3/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /home/eduardoj/Repo/eduardo4jesus/PyTorch-cuFFT-Convolution/cudnn_utils.cpp -o cudnn_utils.o 
[2/3] c++ -MMD -MF cudnn_convolution.o.d -DTORCH_EXTENSION_NAME=cudnn_convolution -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/eduardoj/miniconda3/lib/python3.9/site-packages/torch/include -isystem /home/eduardoj/miniconda3/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/eduardoj/miniconda3/lib/python3.9/site-packages/torch/include/TH -isystem /home/eduardoj/miniconda3/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/eduardoj/miniconda3/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /home/eduardoj/Repo/eduardo4jesus/PyTorch-cuFFT-Convolution/cudnn_convolution.cpp -o cudnn_convolution.o 
[3/3] c++ cudnn_convolution.o cudnn_utils.o -shared -lcudnn -L/home/eduardoj/miniconda3/lib/python3.9/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cudnn_convolution.so

I am hoping that by changing the order the issue would get fixed.

Could you run a quick check by rebuilding PyTorch from source using your system CUDA toolkit and cuDNN? This would then dynamically link to these libraries and the extension shouldn’t be able to use any libraries shipped in the binaries (as they are not installed).
Maybe a docker container could work as it would already come with the properly installed libraries and you could then rebuild PyTorch there.
If this works, it could indeed point to a library mismatch.

Despite not being a quick building, that solved my issue! Many thanks. :pray:

The only issue I am having now is on why CUDNN_CONVOLUTION_FWD_ALGO_FFT yields CUDNN_STATUS_NOT_SUPPORTED for the following configuration

B, F, C = 256, 512, 512
N, K, O = 32, 5, 32
padding  = (2, 2)

# create dummy input, convolutional weights and bias
input  = torch.zeros(B, C, N, N).to('cuda')
weight = torch.zeros(F, C, K, K).to('cuda')
output = torch.zeros(B, F, O, O).to('cuda')
stride   = (1, 1)
dilation = (1, 1)
groups   = 1

According to Table 22 off the documentation, it should work just fine.

:ballot_box_with_check: xDesc (height and width) + 2*padding < 256

:ballot_box_with_check: stride == 1

:ballot_box_with_check: wDesk (height and width) > padding

OK, great to hear it’s working with the matching libraries.

Could you forward me the cuDNN logs for the CUDNN_STATUS_NOT_SUPPORTED use case so that I could take a look at them (with our cuDNN team), please?
You can find the instructions here with the needed env variables.

1 Like

Thank you for the link.

Here is the log file.

$cat cuDNN.log


I! CuDNN (v8401) function cudnnCreate() called:
i!     handle: location=host; addr=0x55bcb07b1950;
i! Time: 2022-06-16T13:10:01.736287 (0d+0h+0m+0s since start)
i! Process=4107; Thread=4107; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v8401) function cudnnSetStream() called:
i!     handle: type=cudnnHandle_t; streamId=(nil) (defaultStream);
i!     streamId: type=cudaStream_t; streamId=(nil) (defaultStream);
i! Time: 2022-06-16T13:10:02.406407 (0d+0h+0m+1s since start)
i! Process=4107; Thread=4107; GPU=0; Handle=0x55bce02ec740; StreamId=(nil) (defaultStream).


I! CuDNN (v8401) function cudnnCreateTensorDescriptor() called:
i! Time: 2022-06-16T13:10:02.406455 (0d+0h+0m+1s since start)
i! Process=4107; Thread=4107; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v8401) function cudnnSetTensor4dDescriptor() called:
i!     format: type=cudnnTensorFormat_t; val=CUDNN_TENSOR_NCHW (0);
i!     dataType: type=cudnnDataType_t; val=CUDNN_DATA_FLOAT (0);
i!     n: type=int; val=256;
i!     c: type=int; val=512;
i!     h: type=int; val=32;
i!     w: type=int; val=32;
i! Time: 2022-06-16T13:10:02.406468 (0d+0h+0m+1s since start)
i! Process=4107; Thread=4107; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v8401) function cudnnSetTensor4dDescriptorEx() called:
i!     dataType: type=cudnnDataType_t; val=CUDNN_DATA_FLOAT (0);
i!     n: type=int; val=256;
i!     c: type=int; val=512;
i!     h: type=int; val=32;
i!     w: type=int; val=32;
i!     nStride: type=int; val=524288;
i!     cStride: type=int; val=1024;
i!     hStride: type=int; val=32;
i!     wStride: type=int; val=1;
i! Time: 2022-06-16T13:10:02.406473 (0d+0h+0m+1s since start)
i! Process=4107; Thread=4107; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v8401) function cudnnCreateFilterDescriptor() called:
i! Time: 2022-06-16T13:10:02.406485 (0d+0h+0m+1s since start)
i! Process=4107; Thread=4107; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v8401) function cudnnSetFilter4dDescriptor() called:
i!     dataType: type=cudnnDataType_t; val=CUDNN_DATA_FLOAT (0);
i!     format: type=cudnnTensorFormat_t; val=CUDNN_TENSOR_NCHW (0);
i!     k: type=int; val=512;
i!     c: type=int; val=512;
i!     h: type=int; val=5;
i!     w: type=int; val=5;
i! Time: 2022-06-16T13:10:02.406490 (0d+0h+0m+1s since start)
i! Process=4107; Thread=4107; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v8401) function cudnnCreateConvolutionDescriptor() called:
i!     convDesc: location=host; addr=0x7ffdc81c24d0;
i! Time: 2022-06-16T13:10:02.426075 (0d+0h+0m+1s since start)
i! Process=4107; Thread=4107; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v8401) function cudnnSetConvolution2dDescriptor() called:
i!     convDesc: location=host; addr=0x55bd1e659f50;
i!     pad_h: type=int; val=2;
i!     pad_w: type=int; val=2;
i!     u: type=int; val=1;
i!     v: type=int; val=1;
i!     dilation_h: type=int; val=1;
i!     dilation_w: type=int; val=1;
i!     mode: type=cudnnConvolutionMode_t; val=CUDNN_CROSS_CORRELATION (1);
i!     dataType: type=cudnnDataType_t; val=CUDNN_DATA_FLOAT (0);
i! Time: 2022-06-16T13:10:02.426160 (0d+0h+0m+1s since start)
i! Process=4107; Thread=4107; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v8401) function cudnnGetConvolution2dForwardOutputDim() called:
i!     convDesc: type=cudnnConvolutionDescriptor_t:
i!         mode: type=cudnnConvolutionMode_t; val=CUDNN_CROSS_CORRELATION (1);
i!         dataType: type=cudnnDataType_t; val=CUDNN_DATA_FLOAT (0);
i!         mathType: type=cudnnMathType_t; val=CUDNN_DEFAULT_MATH (0);
i!         reorderType: type=int; val=0;
i!         arrayLength: type=int; val=2;
i!         padA: type=int; val=[2,2];
i!         strideA: type=int; val=[1,1];
i!         dilationA: type=int; val=[1,1];
i!         groupCount: type=int; val=1;
i!     inputTensorDesc: type=cudnnTensorDescriptor_t:
i!         dataType: type=cudnnDataType_t; val=CUDNN_DATA_FLOAT (0);
i!         nbDims: type=int; val=4;
i!         dimA: type=int; val=[256,512,32,32];
i!         strideA: type=int; val=[524288,1024,32,1];
i!     filterDesc: type=cudnnFilterDescriptor_t:
i!         dataType: type=cudnnDataType_t; val=CUDNN_DATA_FLOAT (0);
i!         vect: type=int; val=0;
i!         nbDims: type=int; val=4;
i!         dimA: type=int; val=[512,512,5,5];
i!         format: type=cudnnTensorFormat_t; val=CUDNN_TENSOR_NCHW (0);
i!     n: location=host; addr=0x7ffdc81c2384;
i!     c: location=host; addr=0x7ffdc81c2388;
i!     h: location=host; addr=0x7ffdc81c238c;
i!     w: location=host; addr=0x7ffdc81c2390;
i! Time: 2022-06-16T13:10:02.426175 (0d+0h+0m+1s since start)
i! Process=4107; Thread=4107; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v8401) function cudnnCreateTensorDescriptor() called:
i! Time: 2022-06-16T13:10:02.426191 (0d+0h+0m+1s since start)
i! Process=4107; Thread=4107; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v8401) function cudnnSetTensor4dDescriptor() called:
i!     format: type=cudnnTensorFormat_t; val=CUDNN_TENSOR_NCHW (0);
i!     dataType: type=cudnnDataType_t; val=CUDNN_DATA_FLOAT (0);
i!     n: type=int; val=256;
i!     c: type=int; val=512;
i!     h: type=int; val=32;
i!     w: type=int; val=32;
i! Time: 2022-06-16T13:10:02.426197 (0d+0h+0m+1s since start)
i! Process=4107; Thread=4107; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v8401) function cudnnSetTensor4dDescriptorEx() called:
i!     dataType: type=cudnnDataType_t; val=CUDNN_DATA_FLOAT (0);
i!     n: type=int; val=256;
i!     c: type=int; val=512;
i!     h: type=int; val=32;
i!     w: type=int; val=32;
i!     nStride: type=int; val=524288;
i!     cStride: type=int; val=1024;
i!     hStride: type=int; val=32;
i!     wStride: type=int; val=1;
i! Time: 2022-06-16T13:10:02.426202 (0d+0h+0m+1s since start)
i! Process=4107; Thread=4107; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v8401) function cudnnGetConvolutionForwardWorkspaceSize() called:
i!     handle: type=cudnnHandle_t; streamId=(nil) (defaultStream);
i!     xDesc: type=cudnnTensorDescriptor_t:
i!         dataType: type=cudnnDataType_t; val=CUDNN_DATA_FLOAT (0);
i!         nbDims: type=int; val=4;
i!         dimA: type=int; val=[256,512,32,32];
i!         strideA: type=int; val=[524288,1024,32,1];
i!     wDesc: type=cudnnFilterDescriptor_t:
i!         dataType: type=cudnnDataType_t; val=CUDNN_DATA_FLOAT (0);
i!         vect: type=int; val=0;
i!         nbDims: type=int; val=4;
i!         dimA: type=int; val=[512,512,5,5];
i!         format: type=cudnnTensorFormat_t; val=CUDNN_TENSOR_NCHW (0);
i!     convDesc: type=cudnnConvolutionDescriptor_t:
i!         mode: type=cudnnConvolutionMode_t; val=CUDNN_CROSS_CORRELATION (1);
i!         dataType: type=cudnnDataType_t; val=CUDNN_DATA_FLOAT (0);
i!         mathType: type=cudnnMathType_t; val=CUDNN_DEFAULT_MATH (0);
i!         reorderType: type=int; val=0;
i!         arrayLength: type=int; val=2;
i!         padA: type=int; val=[2,2];
i!         strideA: type=int; val=[1,1];
i!         dilationA: type=int; val=[1,1];
i!         groupCount: type=int; val=1;
i!     yDesc: type=cudnnTensorDescriptor_t:
i!         dataType: type=cudnnDataType_t; val=CUDNN_DATA_FLOAT (0);
i!         nbDims: type=int; val=4;
i!         dimA: type=int; val=[256,512,32,32];
i!         strideA: type=int; val=[524288,1024,32,1];
i!     algo: type=cudnnConvolutionFwdAlgo_t; val=CUDNN_CONVOLUTION_FWD_ALGO_FFT (4);
i!     sizeInBytes: location=host; addr=0x7ffdc81c2488;
i! Time: 2022-06-16T13:10:02.426253 (0d+0h+0m+1s since start)
i! Process=4107; Thread=4107; GPU=0; Handle=0x55bce02ec740; StreamId=(nil) (defaultStream).


I! CuDNN (v8401) function cudnnGetErrorString() called:
i!     status: type=int; val=9;
i! Time: 2022-06-16T13:10:02.426424 (0d+0h+0m+1s since start)
i! Process=4107; Thread=4107; GPU=NULL; Handle=NULL; StreamId=NULL.


Following the instructions you sent, the log above was generated after executing these instructions:

export CUDNN_LOGINFO_DBG=1 CUDNN_LOGWARN_DBG=1 CUDNN_LOGERR_DBG=1 CUDNN_LOGDEST_DBG=cuDNN.log
python example.py