How to send and receive from cuda simultaneously from the same process?

I’m new to pytorch / cuda … trying to figure this out. Let us say I have two tensors x_gpu and y_gpu. Their corresponding cpu tensors are x_cpu and y_cpu. I’d like to execute this transfer:
x_gpu = x_cpu.cuda() # PCI-e down link
y_cpu = y_gpu.cpu() # PCI-e up link

It’d be great to have them non-blocking so they can operate in parallel. So, I set non_blocking=True attribute: x_cpu.cuda(non_blocking = True). That didn’t make any difference to the measured bandwidth which is pretty low (some 192 GB/sec). If I only write in one direction, I get 288 GB/sec (which is also really low, another mystery at this point). Thanks!