Whenever I do a .to(device) for any tensor in my code it takes too much time and makes my code slow. I am not able to understand why is that happening?
Your tensors might be huge and the bandwidth to your GPU not big enough thus slowing down the process.
You can reduce your batchsize and accumulate the gradient (by not using optimizer.zero_grad() at every backward() call). This will most likely reduce the bottleneck around the bandwidth.
if you reduce your batchsize by let’s say 4 you can use
if counter % 4 == 0: optimizer.zero_grad() . have fun with your loss here . if counter % 4 == 0: optimizer.step() counter += 1
Note as well that the
.to() functions are the only blocking operations on the CUDA api that is completely asynchronous otherwise.
So if you try to time your code without using
torch.cuda.synchronize() properly, it is expected that the
.to() functions are the only ones that will take a significant amount of time ! But this is only because these functions are waiting on the GPU to finish processing all other operations before returning.
Actually I was using .to(device) in different code (not for calculating loss) where I need to create torch variable and copy my inputs to this variable and do some operations on this variable.
with torch.cuda.synchronize() as well I get the same timing.
As in still .to(device) consume too much time in my profiling results.