Slow Batch Matrix Multiplication on GPU

I am using Batch Matrix Multiplication on 2 3d tensors of sizes (100 , 128 , 128 ) each.

import torch
a = torch.randn(100,128,128)
b = torch.randn(100,128,128)

import time

t0 = time.time()
torch.bmm(a,b)
print(time.time() - t0)

0.03233695030212402

Now if i do the same thing on CPU it takes a lot longer

a = a.cuda()
b = b.cuda()
t0 = time.time()
torch.bmm(a,b)
print(time.time() - t0)

30.574532985687256

Why does it take so long to solve on GPU?
I have a GTX 1050 Ti 4GB
And processor core i3-6100 3.7Ghz

torch.cuda.synchronize()
t0 = time.time()
torch.bmm(a,b)
torch.cuda.synchronize()
print(time.time() - t0)

Without synchronize you are delaying cuda initializations and bmm results.

2 Likes

This does solve the humongous time take by GPU to solve it.
Iā€™m getting better results with the GPU now, even if marginally better.
Thank you.