Timing code using cProfiler or kernprof

I have two functions (let’s say f1 and f2) that use inbuilt PyTorch functions like mul, div etc. Both these functions are called one after the other. I am using kernprof to time my code. I have used torch.cuda.synchronize() in my code.

If I time my code with kernprof (or cProfiler) it gives me output that says f1 takes 90 sec and f2 takes 0.5 sec. So I tried making f1 faster and now f1 takes 24secs but f2 starts taking 50secs.

Any idea why this would happen? both f1 and f2 are independent. It just that f1’s output is used as input to f2. But I don’t understand why making f1 faster will make f2 slower.