Dear Community,
I’m working on a project and pytorch works fine, but it’s just too slow. So I rewrote the slower parts of the algorithm with numba.cuda, a jit cuda programming package. In unit tests, each part works fine. But combined, it got weird. And in Python’s debug mode, everything miraculously works fine again.
I think the reason for this problem is due to that, pytorch launches a new kernel before numba.cuda’s kernel has finished working. There is no way for pytorch to know when numba.cuda’s kernel has finished. In my expriment, behind numba.cuda’s kernel lunch code, I immediately clone the results of numba.cuda to tensors twice and found that the two tensors cloned from the same results are not equal, which means that the content of the address has changed between the two clones. In debug mode, everything runs slower, giving numba.cuda time to finish its calculations before pytorch starts the new kernel.
I tried adding numba.cuda.synchronize() to the code, but it didn’t help. I would like to know if there is any possible solution.