I have a counterintutive result on measuring the running time of following matrix-vector multiplication

first case we do normal matrix-vector multiplication

```
Khh = torch.rand((5,1,10001,10001)).cuda()
uh = torch.rand((5,1,10001)).cuda()
niter = 1000
times = []
for i in range(niter):
torch.cuda.synchronize()
start_time = time.time()
wh = torch.einsum('bcmn,bcn->bcm', Khh, uh) * h
torch.cuda.synchronize()
end_time = time.time()
elapsed = end_time - start_time
times.append(elapsed)
print(sum(times) / niter)
```

the average time on my machine is 0.003968036651611328

the second case, I down sampled Khh along one axis

```
niter = 1000
times = []
for i in range(niter):
start_time = time.time()
KhH = Khh[...,::2]
uH = uh[...,::2]
wh = torch.einsum('bcmn,bcn->bcm', KhH, uH) * H
end_time = time.time()
elapsed = end_time - start_time
times.append(elapsed)
print(sum(times) / niter)
```

the average time of the second case is 0.0053208649158477785

The result is confusing, since we do less multiplication and addition.

I donâ€™t know why the second case costs more time compare to the first one