Inference/forward passes not asynchronous?

I am currently having problems with the forward operation being seemingly synchronous when using LibTorch in C++. I’ve taken this model and converted it using a TorchScript trace and loaded it into my C++ program using LibTorch. Before calling a number of tensor operations and forward, I set the stream using CUDAStreamGuard:

streams.push_back(torch::cuda::getStreamFromPool());
torch::cuda::CUDAStreamGuard guard(streams[i]);

For the tensor operations, data type conversions and transfers from CPU to GPU are done using non_blocking=true. Other operations are assumed to be asynchronous per the LibTorch documentation.

Without fail, all of the operations return immediately, except for forward.

torch::Tensor result = rcan.forward(inputs).toTensor();

As my GPU usage is less than 30% according to nvidia-smi, I would like to run multiple forward passes on independent data in parallel using streams. If anyone has some insights on whether or not forward can be called synchronously, or if I am approaching this incorrectly, I would greatly appreciate it. Thank you.