I have network with a rather complicated forward pass that includes several for-loops and I accounted that for the 7 seconds one forward pass needed to complete. These 7 seconds where measured with:
start = time.process_time()
output = network(input_batch)
end = time.process_time()
However if I take the time inside the forward method of the model (start before the first statement and end just before the return) it results in a time of around 0.03 seconds.
Where does this difference come from and can I do something to reduce it?
If you are executing the forward pass on the GPU, you should add torch.cuda.synchronize before starting and stopping the timer, as CUDA operations are executed asynchronously.
Currently you might time the kernel launch times or some other operations, which create a synchronization point.