Hey folks,
Been trying to optimize some code recently. Turns out that there’s a lot of time spent on .cpu() calls (or, more specifically, .data.cpu().numpy() type calls in general), which are becoming are becoming a huge computational bottleneck for me.
For instance, the per-single-image forward pass of my (RL) model takes on order ~0.002 seconds on GPU. However, when I need to return this action to do reward-dependent numpy calculations, the detaching from graph takes ~0.02 seconds.
Is there any way to optimize this? I feel like I’ve tried a lot of different things, but it all ends up coming down to detaching from the graph/gpu.
Thanks