Pinned memory can't provide any speedup

I experience same thing with a single GPU setup. Passing pin_memory=True to DataLoader does not seem to improve performance in any way. From cProfile it seems like torch._C.CudaFloatTensorBase._copy() consumes one third of all batch processing time, which is a lot! Both with and without pinned_memory(). Thank you!

So majority of time was spent moving variables to GPU and doing forward pass (this is GPU), backward pass took surprisingly little time.

UPDATE: It turns out that if one passes CUDA_LAUNCH_BLOCKING=1 when running a script, profiling results are much more meaningful. Here, for example, majority of time is spent in backwards_run and forward, which makes sense.