Weird problem - backward() slows arrray access?

As a part of a computation, I need to access a member of an output variable after I do a backward pass on it. The weird thing is, if I do this in GPU, this access takes a lot of time compared to doing this in CPU. In the code snippet below, I is an image with size 3x224x224 and network is resnet50 from the models:

x = Variable(I.unsqueeze_(0).cuda(),requires_grad=True)  
output = net(x)
ff =


alp = ff[100]

Now, normally I expect the last line to take less than a milisecond and this is the case if I do not use the GPU (if I remove the .cuda()s). However, in this case, it takes around 100 ms. Is there a problem? Is the computer alp from the GPU memory and that is causing the issue? Or am I doing something wrong?


All the CUDA API is asynchronous so the backward call will return before everything is actually computed.
Your last line actually access some data from the GPU and thus forces a synchronization and thus waits for all computation to finish.
You can add before your last line a torch.cuda.synchronize() and your last line will become instant again.

Ok, I see now. Thanks for your reply.