Pinned memory can't provide any speedup

I am training network with two GPUs.
I found many topics said that pinned memory can help improve the training speed a lot. But when I used pinned memory, it has not any speedup.Following is my code.

1, Make the DataLoader return batches placed in pinned memory by passing pin_memory=True to its constructor.

data_loader = torch.utils.data.DataLoader(
dataset,
batch_size=self.opt.batchSize, #batchSize is 6.
shuffle=bool(self.opt.shuffle_data),
num_workers=int(self.opt.nThreads), pin_memory=True)

2, Use pin_memory() method and pass an additional async=True argument to a cuda() call.

#data is torch.FloatTensor, come from DataLoader
#self.input_A and self.input_B are torch.cuda.FloatTensor
input_A = torch.Tensor.pin_memory(data)
input_B = torch.Tensor.pin_memory(data)
self.input_A.resize_(input_A.size()).copy_(input_A, async=True)
self.input_B.resize_(input_B.size()).copy_(input_B, async=True)
self.real_A = Variable(self.input_A)
self.fake_B = self.netG.forward(self.real_A)
self.real_B = Variable(self.input_B)
--------start training D and G network-------------

Is my code wrong? Does anyone has any ideas or give me a example to show how to use pinned memory?, thanks a lot.

I experience same thing with a single GPU setup. Passing pin_memory=True to DataLoader does not seem to improve performance in any way. From cProfile it seems like torch._C.CudaFloatTensorBase._copy() consumes one third of all batch processing time, which is a lot! Both with and without pinned_memory(). Thank you!

So majority of time was spent moving variables to GPU and doing forward pass (this is GPU), backward pass took surprisingly little time.

UPDATE: It turns out that if one passes CUDA_LAUNCH_BLOCKING=1 when running a script, profiling results are much more meaningful. Here, for example, majority of time is spent in backwards_run and forward, which makes sense.