Converting from pytorch.tensor() to numpy array is too slow!

I’m converting pytorch.tensor() object to numpy array like the below code.


But it takes approximately 0.33 seconds. Is it normal?


This is most likely the tensor_data.cpu() that is slow? Not the .detach().numpy() right?

Yes tensor_data.cpu() is slowing the operation. Do you know how I can convert numpy array without converting cpu() ?

If your data is on the GPU, you would have to transfer the data to the RAM first via .cpu() and call numpy() on it.
Note that (as @albanD mentioned) the numpy() call should be really cheap, as the underlying data will be shared and no copy will be involved.

Since CUDA operations are asynchronous, the .cpu() call will create a synchronization point, so that all currently executed and queued ops on the GPU have to finish, before the tensor will be pushed to the host.
The data transfer might be indeed not taking the majority of the time, but other CUDA operations in the background (e.g. your model’s forward/backward pass).


So does this means that if .cpu() is waiting super long for synchronization, it basically says that my GPU is not strong enough to compute all the results I need to transfer to CPU? Am I understanding it correctly?

If you are not synchronizing the code, the next sync point will just accumulate the runtime.

No, the result will be calculated and pushed to the CPU once it’s finished. Since the CUDA operation is executed asynchronously, the Python script executes the next line of code right after launching the CUDA kernel. Since the calculation on the GPU will take “some” time, the next line of code would wait, if it’s a sync point.