I’m working with C++ extension for pytorch, and I have such a question: pytorch asynchronously launch CUDA kernels, suppose we have these codes:
__global__ XXXKernel(float *a, float *b) {
//some computational workload with a, b
}
at::Tensor XXX(at::Tensor a, at::Tensor b) {
XXXKernel<<<...>>>(a.data_ptr<float>(), b.data_ptr<float>());
//No cudaStreamSynchronize or cudaDeviceSynchronize here
}
//use pybind11 to bind XXX to a python function ext.XXX
python code:
def test() -> torch.Tensor:
a = torch.randn(10000)
b = torch.randn(10000)
return ext.XXX(a, b)
test()
Because of asynchronously launched kernel, test function may return before XXXKernel finished, is it right? Further question is it possible the variable a, b may be released by GC before XXXKernel finished? The internels buffers of a, b may be freed before XXXKernel finished? So, how to guarantee resources were held?
OK, thanks. So I should keep reference counts of a, b until the kernel execution finished.
But for the 3rd question I still don’t understand, When C++ XXX function returns, a, b destruct and decrease reference counts. If python function test returned and python variable a, b were just deleted by GC(I’ m not sure, I think that a, b may be collected by GC after test function returned, or this hypothesis is impossible?) , at this time, the two tensors should have 0 references and free theirs buffers. meanwhile XXX kernel is not finished, this could cause Warp illegal address?
If I want to keep reference counts to tensor a,b until the kernel execution finished, Should I copy at::Tensor variables a,b internally in C++ code and keep them util kernel execution finished? like this:
(BTW: I don’t know how to implement ‘isNotFinished’)
at::Tensor XXX(at::Tensor a, at::Tensor b) {
XXXKernel<<<...>>>(a.data_ptr<float>(), b.data_ptr<float>());
//No cudaStreamSynchronize or cudaDeviceSynchronize here
std::thread([guard_a=a, guard_b=b]() {
while(isNotFinished(XXXKernel))
std::this_thread::yield();
guard_a.~Tensor();
guard_b.~Tensor();
}).detach();
}
Oh, I just learned how to implement ‘isNotFinished’, I can use cudaLaunchHostFunc to add a callback to release guard_a, guard_b after XXXKernel finish execution.