I’m writing an pytorch operator using CUDA/C++ extension, and I need to use more than 48KB shared memory size in each thread block. The platform is GPU V100.
According to this post on stackoverflow, when the shared memory size exceeds 48KB, I need to call
cudaFuncSetAttribute in the host function, and then set the shared memory size.
My question is, is there any pytorch api can help me to do this? Just like the
at::cuda::getCurrentCUDABlasHandle() api for getting current cublas handle instead of creating a new one.