Converting from pytorch.tensor() to numpy array is too slow!

soaxeus · January 8, 2020, 11:46am

I’m converting pytorch.tensor() object to numpy array like the below code.

tensor_data.cpu().detach().numpy()

But it takes approximately 0.33 seconds. Is it normal?

albanD · January 8, 2020, 3:07pm

Hi,

This is most likely the tensor_data.cpu() that is slow? Not the .detach().numpy() right?

soaxeus · January 9, 2020, 6:11am

Yes tensor_data.cpu() is slowing the operation. Do you know how I can convert numpy array without converting cpu() ?

ptrblck · January 9, 2020, 7:41am

If your data is on the GPU, you would have to transfer the data to the RAM first via .cpu() and call numpy() on it.
Note that (as @albanD mentioned) the numpy() call should be really cheap, as the underlying data will be shared and no copy will be involved.

Since CUDA operations are asynchronous, the .cpu() call will create a synchronization point, so that all currently executed and queued ops on the GPU have to finish, before the tensor will be pushed to the host.
The data transfer might be indeed not taking the majority of the time, but other CUDA operations in the background (e.g. your model’s forward/backward pass).

Hao_Hao_Tan · September 1, 2020, 9:58am

So does this means that if .cpu() is waiting super long for synchronization, it basically says that my GPU is not strong enough to compute all the results I need to transfer to CPU? Am I understanding it correctly?

ptrblck · September 1, 2020, 10:30am

If you are not synchronizing the code, the next sync point will just accumulate the runtime.

No, the result will be calculated and pushed to the CPU once it’s finished. Since the CUDA operation is executed asynchronously, the Python script executes the next line of code right after launching the CUDA kernel. Since the calculation on the GPU will take “some” time, the next line of code would wait, if it’s a sync point.