Unreasonable time consuming between CPU and GPU

i have some experiments on following toy code, but i have some question about this, can anyone help me ? thank you very much.

#dtype = tc.FloatTensor
dtype = tc.cuda.FloatTensor

x = tc.randn(100000, 50).type(dtype)
w1 = tc.randn(50, 30).type(dtype)

s = time.time()
l = x.mm(w1)
e = time.time()
print e - s

s = time.time()
l = []
for i in range(x.size(0)):
    l.append(x[i].unsqueeze(0).mm(w1).squeeze(0))
    l = tc.stack(l, dim=0)
e = time.time()
print e - s

if i run this on CPU, the consuming time are: 0.017 sec and 0.696 sec
however, when i run this on GPU, strange things happen, the consuming time became more : 0.314 sec and 1.592 sec.
i think calculation on GPU should take less time , right ?

besides, for loop takes more time than matrix multiplication, which is reasonable.

someone people explain this for me ? thank you.

I think this is expected. There is some overhead in launching CUDA code and you are launching many kernels, and in your case the matrices are very small so there is no benefit in using GPUs over CPUs.
Note that you can use torch.bmm to perform batched matrix multiplication, and it will be orders of magnitude faster than a for loop.

1 Like

thank you for replying …