How does pytorch guarantee resources were held when asynchronously launch kernels

I’m working with C++ extension for pytorch, and I have such a question: pytorch asynchronously launch CUDA kernels, suppose we have these codes:

__global__ XXXKernel(float *a, float *b) {
         //some computational workload with a, b
}

at::Tensor XXX(at::Tensor a, at::Tensor b) {
        XXXKernel<<<...>>>(a.data_ptr<float>(), b.data_ptr<float>());
        //No cudaStreamSynchronize or cudaDeviceSynchronize here
}

//use pybind11 to bind XXX to a python function ext.XXX

python code:

def test() -> torch.Tensor:
    a = torch.randn(10000)
    b = torch.randn(10000)
    return ext.XXX(a, b)

test()

Because of asynchronously launched kernel, test function may return before XXXKernel finished, is it right? Further question is it possible the variable a, b may be released by GC before XXXKernel finished? The internels buffers of a, b may be freed before XXXKernel finished? So, how to guarantee resources were held?

The kernel launches aren’t asynchronous but the kernel execution. The launches themselves are executed by the host and in stream-order.

Yes, the host can return and continue its execution before the GPU work is done.

No, this shouldn’t be possible as the reference count should be increased while the object is used.

OK, thanks. So I should keep reference counts of a, b until the kernel execution finished.

But for the 3rd question I still don’t understand, When C++ XXX function returns, a, b destruct and decrease reference counts. If python function test returned and python variable a, b were just deleted by GC(I’ m not sure, I think that a, b may be collected by GC after test function returned, or this hypothesis is impossible?) , at this time, the two tensors should have 0 references and free theirs buffers. meanwhile XXX kernel is not finished, this could cause Warp illegal address?

If I want to keep reference counts to tensor a,b until the kernel execution finished, Should I copy at::Tensor variables a,b internally in C++ code and keep them util kernel execution finished? like this:
(BTW: I don’t know how to implement ‘isNotFinished’)



at::Tensor XXX(at::Tensor a, at::Tensor b) {
        XXXKernel<<<...>>>(a.data_ptr<float>(), b.data_ptr<float>());
        //No cudaStreamSynchronize or cudaDeviceSynchronize here

        std::thread([guard_a=a, guard_b=b]() {
                while(isNotFinished(XXXKernel)) 
                        std::this_thread::yield();
                guard_a.~Tensor(); 
                guard_b.~Tensor();
        }).detach();
}

Really appreciate your help!

Oh, I just learned how to implement ‘isNotFinished’, I can use cudaLaunchHostFunc to add a callback to release guard_a, guard_b after XXXKernel finish execution.

1 Like