Hi all,
I have a data processor that is written by Cuda C, and I used PyCuda as the API to call this. So essentially, the kernel creates the data at GPU
__global__ void create_data(float* data)
{
// process the data using Cuda kernel
}
the data will be processed within the Cuda kernel function, and finally update this float* data
which is sitting, I guess, in the global memory at GPU. Now, the straightforward way is I can copy this data back to CPU host, this is easy by Cuda. Then I call Pytorch to send data in CPU again back to GPU using like data.to(“cuda”), this is also easy.
But obviously this is redundant if Pytorch has some API that directly access to the float* data
in GPU memory, so the program does not need to do this unnecessary GPU->CPU->GPU data transfer