Hi I have an implementation of a kernel in CUDA C++.
I am using it with torch library API in my python code.
Everything works great (the kernel runs on the GPU on regular calls), but when I try to capture a graph with the code, I get a message that the graph is empty (and replay does not runs the kernel of course).
I would apricate any help to know why this is happening, and any advice on how to capture a kernel in PyTorch which is implemented in CUDA C++.
Thanks.
I remember seeing a similar issue recently here where a user forgot to pass the surrounding CUDAStream to the custom kernel, but I cannot find the thread right now. Could this be the case here, too?
Wow thanks!!! this works now. Why is that the case though?
From the docs:
Capture must occur on a non-default stream
which will be set as the default stream in the context manager and has to be passed to custom functions as well.