Multi-threaded, Single GPU libtorch usage

I’m running multi-threaded inference using libtorch using something like (pseudocode for simplicity):

in_tensor = in_tensor.to(cuda, dtype, /*non_blocking=*/ false, /*copy=*/ false, memory_format); 
semaphore.acquire(); 
out_tensor = model.forward(in_tensor); // actually wrapped in ivalue, eliding that
cuda::synchronize(); // Does this also sync any thread-parallel data transfers?
semaphore.release(); 
out_tensor = out_tensor.to(cpu, dtype, /*non_blocking=*/ false, /*copy=*/ false, memory_format); 

So multiple threads may be calling in_tensor.to(cuda...) / out_tensor.to(cpu...) in parallel, but only a single thread is calling model.forward().

I want to ensure that my data transfers CPU ↔ GPU do indeed run in parallel with the model.forward()-call and don’t interfere with the computations on the GPU, but I’m unclear on the precise meaning of non_blocking=false - it blocks the calling thread (great!), but does it also block the cuda stream, thereby potentially interrupting any ongoing model.forward()?

And when I call cuda::synchronize() to ensure model.forward() has actually completed (it returns right away, i.e. non-blocking), does that then wait for any ongoing tensor transfer to complete? That would be very undesirable.

I did some performance testing of various permutations, and here are the results:

to - sem.acq - forward - from - sem.release: 2,474 images processed
to - sem.acq - forward - sem.release - from: 2,491 images processed
sem.acq - to - forward - from - sem.release: 2,484 images processed
to.nonblock - sem.acq - forward - from - sem.release: 2,404 images processed
to.nonblock - forward - from: 2,271 images processed

(to = CPU → GPU, from = GPU → CPU)

So there doesn’t seem to be any real difference between the first three, and the latter two see a degradation in performance.

That tells me that there’s no successful parallelization between data transfer and kernel computation, and that using non_blocking=true or skipping synchronization leads to some contention that causes delays.

What is the recommended recipe for doing data transfer in parallel with kernel computation?

CC: @ptrblck