Torch.bmm perf question

Does this 2x perf difference make sense?

In [4]: %timeit torch.mm(torch.Tensor(512, 20).cuda(), torch.Tensor(512, 20).cuda().t())
The slowest run took 31073.84 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 78.9 µs per loop

In [5]: %timeit torch.bmm(torch.Tensor(20, 512, 1).cuda(), torch.Tensor(20, 512, 1).cuda().transpose(-1, -2))
The slowest run took 64.09 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 128 µs per loop

Thanks!

If you want to benchmark CUDA time add ; torch.cuda.synchronize() at the end of the command. Also, it seems that this first like also had to initialize CUDA (you can see there was only 1 iteration in a loop, because timeit thought it will take forever to run it every time). This means that the timing is not reliable

@apaszke Thanks for the tip. Corrected times below, hopefully correct:

In [1]: import torch; torch.Tensor(1).cuda() # for cuda init
Out[1]:
1.00000e-36 *
  5.4583
[torch.cuda.FloatTensor of size 1 (GPU 0)]

In [4]: %timeit torch.mm(torch.Tensor(512, 20).cuda(), torch.Tensor(512, 20).cuda().t()); torch.cuda.synchronize()
The slowest run took 4.69 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 72.7 µs per loop

In [5]: %timeit torch.bmm(torch.Tensor(20, 512, 1).cuda(), torch.Tensor(20, 512, 1).cuda().transpose(-1, -2)); torch.cuda.synchronize()
10000 loops, best of 3: 135 µs per loop