i have some experiments on following toy code, but i have some question about this, can anyone help me ? thank you very much.
#dtype = tc.FloatTensor dtype = tc.cuda.FloatTensor x = tc.randn(100000, 50).type(dtype) w1 = tc.randn(50, 30).type(dtype) s = time.time() l = x.mm(w1) e = time.time() print e - s s = time.time() l =  for i in range(x.size(0)): l.append(x[i].unsqueeze(0).mm(w1).squeeze(0)) l = tc.stack(l, dim=0) e = time.time() print e - s
if i run this on CPU, the consuming time are: 0.017 sec and 0.696 sec
however, when i run this on GPU, strange things happen, the consuming time became more : 0.314 sec and 1.592 sec.
i think calculation on GPU should take less time , right ?
for loop takes more time than matrix multiplication, which is reasonable.
someone people explain this for me ? thank you.