I ran into some problem when I copy tensor from cuda to cpu
if copy it directly, it is very fast
# b shape < 1, 3, 32,32 >
b = Variable(torch.randn(1,3,32,32).cuda())
t1 = time.time()
c = output.cpu().data.numpy()
t2 = time.time()
print(t2-t1)
# time cost is about 0.0005s
however, if I forward some input to a net then copy the output to the cpu, it can be extremely slow
a = Variable(torch.FloatTensor(1,3,512,512).cuda())
# output shape < 1, 3, 32, 32>
output = net(a)
t1 = time.time()
c = output.cpu().data.numpy()
t2 = time.time()
print(t2-t1)
# time cost is about 0.02
does anyone have some ideas??