Inference/forward passes not asynchronous?

I am currently having problems with the forward operation being seemingly synchronous when using LibTorch in C++. I’ve taken this model and converted it using a TorchScript trace and loaded it into my C++ program using LibTorch. Before calling a number of tensor operations and forward, I set the stream using CUDAStreamGuard:

torch::cuda::CUDAStreamGuard guard(streams[i]);

For the tensor operations, data type conversions and transfers from CPU to GPU are done using non_blocking=true. Other operations are assumed to be asynchronous per the LibTorch documentation.

Without fail, all of the operations return immediately, except for forward.

torch::Tensor result = rcan.forward(inputs).toTensor();

As my GPU usage is less than 30% according to nvidia-smi, I would like to run multiple forward passes on independent data in parallel using streams. If anyone has some insights on whether or not forward can be called synchronously, or if I am approaching this incorrectly, I would greatly appreciate it. Thank you.

Could you try to use some random data and check the GPU utilization again please?
This could give you a signal, if the data loading pipeline is the bottleneck.