How to synchronize CUDA kernels from other package in Pytorch?

Dear Community,

I’m working on a project and pytorch works fine, but it’s just too slow. So I rewrote the slower parts of the algorithm with numba.cuda, a jit cuda programming package. In unit tests, each part works fine. But combined, it got weird. And in Python’s debug mode, everything miraculously works fine again.

I think the reason for this problem is due to that, pytorch launches a new kernel before numba.cuda’s kernel has finished working. There is no way for pytorch to know when numba.cuda’s kernel has finished. In my expriment, behind numba.cuda’s kernel lunch code, I immediately clone the results of numba.cuda to tensors twice and found that the two tensors cloned from the same results are not equal, which means that the content of the address has changed between the two clones. In debug mode, everything runs slower, giving numba.cuda time to finish its calculations before pytorch starts the new kernel.

I tried adding numba.cuda.synchronize() to the code, but it didn’t help. I would like to know if there is any possible solution.

You might want to use record_stream as explained in the CUDA Streams section of the docs.

Thank you very much, that does solve the problem. I use pytorch to generate a stream and pass it to the numba in a way that the pytorch stream and numba stream can wait for synchronize.

This brings me to a question, previously I was using default streams in both pytorch and numba (stream address 0), why wouldn’t they execute sequentially like in one stream?

I’m unsure if both libraries are using the same stream and you should check if with a visual profiler, e.g. Nsight Systems to see which streams are used for the execution.