.cpu() calls being a severe bottleneck

Hey folks,

Been trying to optimize some code recently. Turns out that there’s a lot of time spent on .cpu() calls (or, more specifically, .data.cpu().numpy() type calls in general), which are becoming are becoming a huge computational bottleneck for me.

For instance, the per-single-image forward pass of my (RL) model takes on order ~0.002 seconds on GPU. However, when I need to return this action to do reward-dependent numpy calculations, the detaching from graph takes ~0.02 seconds.

Is there any way to optimize this? I feel like I’ve tried a lot of different things, but it all ends up coming down to detaching from the graph/gpu.


detach is fast. the reason why copying to cpu is slow is two-folds:

  1. it is slow
  2. it requires a synchronization, so it appears slower.

what operations do you need in numpy?

This is spot on. If you are timing your model using CPU timers (like whatever is built into Python), then your timings will not reflect GPU times accurately, since kernels are executed asynchronously on the GPU.

1 Like