Trace CUDA context initialization

vadimkantorov · November 15, 2023, 2:38pm

Is it possible to somehow trace/log CUDA context initialization? (including a simple way of tracing CUDA memory allocations which would trigger context initialization too)

Basically, with large frameworks sometimes CUDA context gets unexpectedly initialized and it would be nice to see what it is triggered by (usually some usage somewhere of torch.cuda.is_available() or torch.cuda.get_device_count() or some sloppy tensor allocation on CUDA).

Can I use profiler for this?

ptrblck · November 15, 2023, 4:05pm

I would probably run the workload in gdb and break at cuInit, which should then show which module wrongfully tries to initialize a new context.

vadimkantorov · November 15, 2023, 5:26pm

I guess for this and to get stack traces, one would need a manually-compiled version of Python with debug symbols, and manually compiled version of PyTorch with debug symbols - both are quite nasty to get for a one-time tracing job

I wonder if we could get these cuInit calls from autograd profiler? from text-based output of nsys?

ptrblck · November 16, 2023, 1:51am

No, not necessarily. gdb will also work in release builds, but of course optimizations were performed and also you are right that some symbols might not be shown (e.g. if they were optimized away).
I would first try this approach before starting a debug build.

vadimkantorov · December 26, 2023, 10:00pm

I guess the same task is demonstrated in `import torch` results in `cuInit` call · Issue #116276 · pytorch/pytorch · GitHub

ptrblck · December 27, 2023, 4:25am

Yes, the proposed breakpoint approach is a great way to detect unwanted initializations landing in cuInit as seen after the faulty PR landed.

vadimkantorov · December 27, 2023, 11:22pm

maybe also LD_PRELOAD shim or ltrace lens could be used for similar tracing…