Very slow CUDA initialization with libtorch built with Bazel


We try to add Pytorch as a dependency of our Bazel repo.
When I try to compile very simple programs (addition of two tensor in Cuda) with the Bazel build system,
the first call to a CUDA function take lots of times (about 10 minutes). then all work normally. Behavior is the same with more complex models executed with TorchScript. During initialization, GPU are not under heavy load and VRAM is about 300Mb.

When CUDA is not involved, execution is normal. I tested my program only by adding a binary to the Pytorch repo directly to avoid any interference

Here is my environment:
Cuda 11.3
Pytorch (commit e35bf564611ac00886f6d745b0406d759e054fef)
Bazel 5.0.0 / GCC-8

Tested on a RTX 3090 and a GTX 1650: same behavior

Pierre Falez

I guess CUDA is JIT compiling the application as the binary wasn’t built for the needed architectures.
Make sure to build for all devices you are using if you don’t want to JIT compile the code in the first call.

Thank you for your help !
Issue solved by adding other -gencode to nvcc options