Recently I have had some computations that require specialization on the GPU. I can still leverage PyTorch’s existing modules, however, I’ve noticed that performance is severely impacted due to intermediate steps and sequential processing in the implementation, leading to increased memory usage and runtime. I feel everything could be done better if I were to move it to CUDA C and somehow bind it to Python similar to how PyTorch operates.
However, this seems to be a rather uncommon topic, and although I’ve searched online, I haven’t found an answer that satisfies me.
So, to code a kernel like Conv2D in CUDA C and use it as an API in PyTorch, where should I start? What knowledge do I need to acquire? Could you provide me with some accompanying resources?
Custom extensions can be implemented as described in this tutorial. Before going down this path you might want to check if e.g. torch.compile could speed up your use case.
I find many parts of that tutorial to be a non-friendly starting point. For example, take a look at this statement:
For the “ahead of time” flavor, we build our C++ extension by writing a setup.py script that uses setuptools to compile our C++ code. For the LLTM, it looks as simple as this:
from setuptools import setup, Extension
from torch.utils import cpp_extension
setup(name='lltm_cpp',
ext_modules=[cpp_extension.CppExtension('lltm_cpp', ['lltm.cpp'])],
cmdclass={'build_ext': cpp_extension.BuildExtension})
Terms like “ahead of time” or the arguments of setup() and their purpose are not fully explained. This seems to be a prerequisite to read further. Is there another starting point to start with?