Trace CUDA context initialization

Is it possible to somehow trace/log CUDA context initialization? (including a simple way of tracing CUDA memory allocations which would trigger context initialization too)

Basically, with large frameworks sometimes CUDA context gets unexpectedly initialized and it would be nice to see what it is triggered by (usually some usage somewhere of torch.cuda.is_available() or torch.cuda.get_device_count() or some sloppy tensor allocation on CUDA).

Can I use profiler for this?

I would probably run the workload in gdb and break at cuInit, which should then show which module wrongfully tries to initialize a new context.

I guess for this and to get stack traces, one would need a manually-compiled version of Python with debug symbols, and manually compiled version of PyTorch with debug symbols - both are quite nasty to get for a one-time tracing job :frowning:

I wonder if we could get these cuInit calls from autograd profiler? from text-based output of nsys?

No, not necessarily. gdb will also work in release builds, but of course optimizations were performed and also you are right that some symbols might not be shown (e.g. if they were optimized away).
I would first try this approach before starting a debug build.

1 Like

I guess the same task is demonstrated in `import torch` results in `cuInit` call · Issue #116276 · pytorch/pytorch · GitHub

Yes, the proposed breakpoint approach is a great way to detect unwanted initializations landing in cuInit as seen after the faulty PR landed.

maybe also LD_PRELOAD shim or ltrace lens could be used for similar tracing…