We would like to know whether the current GPU AOT (Ahead-of-Time) Inductor is suitable for compiling kernels on the server and then deploying them to edge devices under the following conditions, and what performance impacts might arise:
- The driver on the server is different from the driver on the deployment device.
- The GPU model on the server is different from the GPU model on the deployment device.
We understand that for CUDA, the AOTI-compiled kernel is in the .cubin
format, which is an architecture-specific native binary. The same .cubin
file is not guaranteed to be compatible across different architectures.
For XPU, the AOTI-compiled kernel is in the SPIR-V format, which is an intermediate representation (IR) that can be further compiled on the deployment device into a binary tailored to the current driver and GPU architecture. However, this introduces additional compilation time on the deployment side. Of course, we can cache the compiled binary to avoid repeated compilation.
Should we consider the situation that different gpu/dirver between compiler end and deploy end for AOTI?
Given the two compiled kernel formats described above, we would like to hear your thoughts on which approach is better.