I am using Batch Matrix Multiplication on 2 3d tensors of sizes (100 , 128 , 128 ) each.
import torch
a = torch.randn(100,128,128)
b = torch.randn(100,128,128)
import time
t0 = time.time()
torch.bmm(a,b)
print(time.time() - t0)
0.03233695030212402
Now if i do the same thing on CPU it takes a lot longer
a = a.cuda()
b = b.cuda()
t0 = time.time()
torch.bmm(a,b)
print(time.time() - t0)
30.574532985687256
Why does it take so long to solve on GPU?
I have a GTX 1050 Ti 4GB
And processor core i3-6100 3.7Ghz