Counterintutive on matrix-vector mulitplication running time

I have a counterintutive result on measuring the running time of following matrix-vector multiplication

first case we do normal matrix-vector multiplication

Khh = torch.rand((5,1,10001,10001)).cuda()
uh = torch.rand((5,1,10001)).cuda()

niter = 1000
times = []

for i in range(niter):
    torch.cuda.synchronize()
    start_time = time.time()
    wh = torch.einsum('bcmn,bcn->bcm', Khh, uh) * h 
    torch.cuda.synchronize()
    end_time = time.time()
    elapsed = end_time - start_time 
    times.append(elapsed)

print(sum(times) / niter)

the average time on my machine is 0.003968036651611328

the second case, I down sampled Khh along one axis

niter = 1000
times = []

for i in range(niter):
    start_time = time.time()
    KhH = Khh[...,::2]
    uH = uh[...,::2]
    wh = torch.einsum('bcmn,bcn->bcm', KhH, uH) * H
    end_time = time.time()
    elapsed = end_time - start_time 
    times.append(elapsed)

print(sum(times) / niter)

the average time of the second case is 0.0053208649158477785

The result is confusing, since we do less multiplication and addition.
I don’t know why the second case costs more time compare to the first one

Could you describe why you removed the needed synchronizations in the second example? Also, it would be interesting to profile the actual matrix multiplication without the slicing kernel.

Sorry, I forgot that line, but after adding synchronizations the result is still same.

Khh = torch.rand((5,1,10001,10001)).cuda()
uh = torch.rand((5,1,10001)).cuda()

niter = 1000
times = []

for i in range(niter):
    torch.cuda.synchronize()
    start_time = time.time()
    wh = torch.einsum('bcmn,bcn->bcm', Khh, uh) * h 
    torch.cuda.synchronize()
    end_time = time.time()
    elapsed = end_time - start_time 
    times.append(elapsed)

print(sum(times) / niter)

niter = 1000
times = []

for i in range(niter):
    torch.cuda.synchronize()
    start_time = time.time()
    KhH = Khh[...,::2]
    uH = uh[...,::2]
    wh = torch.einsum('bcmn,bcn->bcm', KhH, uH) * H
    torch.cuda.synchronize()
    end_time = time.time()
    elapsed = end_time - start_time 
    times.append(elapsed)

print(sum(times) / niter)

the result is as following

0.0036851165294647216
0.0076900506019592285

Thank you