I am currently having problems with the forward operation being seemingly synchronous when using LibTorch in C++. I’ve taken this model and converted it using a TorchScript trace and loaded it into my C++ program using LibTorch. Before calling a number of tensor operations and forward, I set the stream using CUDAStreamGuard:
streams.push_back(torch::cuda::getStreamFromPool());
torch::cuda::CUDAStreamGuard guard(streams[i]);
For the tensor operations, data type conversions and transfers from CPU to GPU are done using non_blocking=true. Other operations are assumed to be asynchronous per the LibTorch documentation.
Without fail, all of the operations return immediately, except for forward.
torch::Tensor result = rcan.forward(inputs).toTensor();
As my GPU usage is less than 30% according to nvidia-smi, I would like to run multiple forward passes on independent data in parallel using streams. If anyone has some insights on whether or not forward can be called synchronously, or if I am approaching this incorrectly, I would greatly appreciate it. Thank you.