I have been working on an object detection framework,enjoy working with pytorch, yet having some execution’s time issue.My entire object detection runs at ~0.034s ,however,when I sum up all the functions contained within the object detection framework it sums up to around half of it.
I actually ran my code with torch.cuda.synchronize() but was missing the ones that come after the function, before the second time such as the following,
When I either use torch.cuda.synchronize() before timing the end_time or using CUDA_LAUNCH_BLOCKING=1 the times sum up to the Entire_Detection time,though,the Entire_Detection time increase by around 30% and the Forward function consumes most of the time.When the Forward function runs asynchronized it takes as mentioned above 0.008 seconds,around 4-5 times slower