I have been working on an object detection framework,enjoy working with pytorch, yet having some execution’s time issue.My entire object detection runs at ~0.034s ,however,when I sum up all the functions contained within the object detection framework it sums up to around half of it.
The times I printed out:
Can you re-run your script with
CUDA_LAUNCH_BLOCKING=1 and see if the times make more sense?
CUDA kernels are asynchronous so the timing information won’t be 100% correct if you don’t take this into account.
Actually,I did take into accout the asynchronized computations and the times above were printed when running my code using CUDA_LAUNCH_BLOCKING=1.
How exactly did you time the functions above?
I actually ran my code with torch.cuda.synchronize() but was missing the ones that come after the function, before the second time such as the following,
outputs = self.net(x)
When I either use torch.cuda.synchronize() before timing the end_time or using CUDA_LAUNCH_BLOCKING=1 the times sum up to the Entire_Detection time,though,the Entire_Detection time increase by around 30% and the Forward function consumes most of the time.When the Forward function runs asynchronized it takes as mentioned above 0.008 seconds,around 4-5 times slower