I have a CUDA application which I want to interface with PyTorch with following conditions:
I do not want to create additional dependency of PyTorch in the C++ application for multiple reasons (e.g., CUDA version conflicts).
I want to send/receive GPU data from PyTorch in python to this C++ code. I am thinking of passing tensor.data_ptr() (after making it contiguous) to the C++ code and also pass a pointer to the pre-initialized output array which will be filled by C++ code. Basically following the approach from: Constructing PyTorch's CUDA tensor from C++ with image data already on GPU
Does 2. seem to be the best way to achieve 1? Any pitfalls I should take into consideration?
Your approach sounds valid and you could take a look at this tutorial to see how a custom CUDA extension can be written (and in particular the CUDA kernel).