Sorry to bother you. But I have a very strange issue when using pytorch with cuda.
The code is used to predict multiple rgb-images’ depth information in a for loop. And during testing, in each loop, the network 1. predicts the depth information and then 2. transfer the tensor to cpu. However, either the first step or the second step will cost a lot of time ( almost 0.5 seconds). But they won’t take a long time at the same time. And more strangely, if I comment the transfer code, there might some other steps which take a rather long time to compute (seems the bottleneck is changing from time to time).
It seems not the problem of the code but might be an issue caused by pytorch. And I have tried torch.cuda.synchronize()
which didn’t work. Do you have any idea about this wired situation?