I’m running multi-threaded inference using libtorch using something like (pseudocode for simplicity):
in_tensor = in_tensor.to(cuda, dtype, /*non_blocking=*/ false, /*copy=*/ false, memory_format);
semaphore.acquire();
out_tensor = model.forward(in_tensor); // actually wrapped in ivalue, eliding that
cuda::synchronize(); // Does this also sync any thread-parallel data transfers?
semaphore.release();
out_tensor = out_tensor.to(cpu, dtype, /*non_blocking=*/ false, /*copy=*/ false, memory_format);
So multiple threads may be calling in_tensor.to(cuda...)
/ out_tensor.to(cpu...)
in parallel, but only a single thread is calling model.forward()
.
I want to ensure that my data transfers CPU ↔ GPU do indeed run in parallel with the model.forward()
-call and don’t interfere with the computations on the GPU, but I’m unclear on the precise meaning of non_blocking=false
- it blocks the calling thread (great!), but does it also block the cuda stream, thereby potentially interrupting any ongoing model.forward()
?
And when I call cuda::synchronize()
to ensure model.forward()
has actually completed (it returns right away, i.e. non-blocking), does that then wait for any ongoing tensor transfer to complete? That would be very undesirable.