Convert tensor to numpy is very slow

I am trying to convert a tensor to numpy array using numpy() function

it is very slow ( takes 50 ms !)

semantic = semantic.cpu().numpy()

semantic is a tensor of size “torch.Size([512, 1024])” and it’s device is cuda:0


I think the slow part is the .cpu() here, not the .numpy().
Sending the Tensor to the CPU requires to sync with the GPU (if there are outstanding computations, that will be extra slow, make sure to torch.cuda.synchronize() before timing) and copy the memory to ram.

The numpy conversion itself should be very fast though as no memory copy occurs.

it is solved ! thanks :smiley:
can you explain what did the synchronization do ?

The short story is that most of the cuda API is asynchronous and so it will just queue up the work to be done and return immediately. Only when you try to access the value (by asking to copy it on cpu here) will it actually wait for the computations to be done.
the synchronize force this waiting to happen and so is very convenient when measuring CPU time to know if a slow call is because of outstanding ops still running asynchronously on the GPU or the call itself.