The potential improved performance of custom configuration for cuda kernels

It appears to me that the torch source code, written in C++/CUDA, generalizes functions to suit various GPU specifications. Suppose I intend to train a large model, such as LLaMA. In that case, I would customize the configuration of kernel calls within torch based on my GPU specifications, possibly even altering algorithms. This approach aims to maximize throughput for my model on the specific GPU. I’m curious if such practices are common in practice. I’ve noticed that sometimes the CUDA kernels within Torch fail to fully utilize the hardware resources of the GPU. However, by modifying the kernel algorithms, I’ve achieved a 50% acceleration.

I acknowledge that my inquiry may be somewhat vague. Yet, considering the factors of GPU cost and training time, customizing Torch seems to offer significant acceleration. I wonder if this concept is feasible or unrealistic within AI technology companies.

Your description is indeed too vague and it’s unclear what exactly was changed to achieve what kind of speedup in which part of the code for what kind of model.
With that being said, it’s unrealistic to try to change e.g. launch parameters of some kernels and rebuild PyTorch for one use case only. You should let DL compilers do these optimizations during the runtime.

1 Like