Is it possible to somehow trace/log CUDA context initialization? (including a simple way of tracing CUDA memory allocations which would trigger context initialization too)
Basically, with large frameworks sometimes CUDA context gets unexpectedly initialized and it would be nice to see what it is triggered by (usually some usage somewhere of torch.cuda.is_available() or torch.cuda.get_device_count() or some sloppy tensor allocation on CUDA).
I guess for this and to get stack traces, one would need a manually-compiled version of Python with debug symbols, and manually compiled version of PyTorch with debug symbols - both are quite nasty to get for a one-time tracing job
I wonder if we could get these cuInit calls from autograd profiler? from text-based output of nsys?
No, not necessarily. gdb will also work in release builds, but of course optimizations were performed and also you are right that some symbols might not be shown (e.g. if they were optimized away).
I would first try this approach before starting a debug build.