Execution times issue


I have been working on an object detection framework,enjoy working with pytorch, yet having some execution’s time issue.My entire object detection runs at ~0.034s ,however,when I sum up all the functions contained within the object detection framework it sums up to around half of it.

The times I printed out:

cuda_copy: 0.004942s
Forward: 0.008s
entire_nms_time: 0.00197s
nms_+_post_process: 0.00228s

Entire_Detection: 0.034s

Can you re-run your script with CUDA_LAUNCH_BLOCKING=1 and see if the times make more sense?

CUDA kernels are asynchronous so the timing information won’t be 100% correct if you don’t take this into account.


Actually,I did take into accout the asynchronized computations and the times above were printed when running my code using CUDA_LAUNCH_BLOCKING=1.


How exactly did you time the functions above?

I actually ran my code with torch.cuda.synchronize() but was missing the ones that come after the function, before the second time such as the following,

outputs = self.net(x)
print(‘Forward: {:.3f}s’.format(end_time-init_time))

When I either use torch.cuda.synchronize() before timing the end_time or using CUDA_LAUNCH_BLOCKING=1 the times sum up to the Entire_Detection time,though,the Entire_Detection time increase by around 30% and the Forward function consumes most of the time.When the Forward function runs asynchronized it takes as mentioned above 0.008 seconds,around 4-5 times slower