Accurate performance estimation for CUDA platforms

Is there any procedure to find the Cuda platform-agnostic execution time of a certain inference operation? I want to compare the execution time of two different inference operations. But, the execution time depends on the already existing load on the GPU and many other factors. I tried the profiler example given here. But, it shows different CUDA times for each independent trial on the same GPU. And, the CUDA times are completely different if I change the GPU. Therefore, it is not an accurate estimation. In computer science, we find the number of CPU cycles to compare the performance of two algorithms. Is there a similar analogy for GPU using PyTorch?

torch.utils.benchmark would probably give you the most accurate estimate, as it will add warmup iterations and synchronize for you.
Different performance on different devices might be expected especially if you are using libs to accelerate the workload, e.g. cuDNN.