High variance in timings across executions

I’ve been trying to evaluate the performance of a few networks, and have not been unable to get consistent timings across executions, even for very small networks or just single convolution blocks. I am aware of the need to call torch.cuda.synchronize, and for good measure, I have set torch.backends.cudnn.benchmark=True. If that is relevant, I am using an nvidia 1080ti under Ubuntu 18.04 and the driver version that I’m using is 396.37.

Consider this simple code, which corresponds to timing a single convolution:

import torch
import time
import matplotlib.pyplot as plt
import os 

os.environ["CUDA_VISIBLE_DEVICES"] = "1"
torch.backends.cudnn.benchmark = True

seed=0
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True

if __name__=="__main__":
   
    in_im=torch.rand(1,3,416,416).cuda()
   
    cnv= torch.nn.Conv2d(
            in_channels=3,
            out_channels=128,
            kernel_size=3,
            stride=1,
            padding=1,
            bias=True).cuda()
 
    skip=10
    num_runs=1000
    timings=[]
    for i in range(num_runs):
        torch.cuda.synchronize()
        t1 = time.time()
        _=cnv(in_im)
        torch.cuda.synchronize()
        t2 = time.time()
        if i>=skip:
            timings.append(t2-t1)

    plt.plot(timings);
    plt.xlabel("execution number")
    plt.ylabel("time (s)")
    plt.xlim([skip,num_runs-skip])
    plt.title("Timing for single convolution block")
    plt.show()

I’m skipping the first 10 executions because the first few forward passes seem to always be slower (I assume because of memory transfers?). Here’s is the plot of the times I’m getting:

As you can see, the std deviation is relatively high (about 3.63e-05s), which might not seem like much, but it accumulates such that in real-world networks (e.g. YOLOv2) I’m getting a much higher variance, which makes it difficult to report a single FPS value for those networks. So my question is, is that normal behavior or am I missing a step for correct timing?

Thank you very much for your help, I appreciate it.