Abnormal memory cost

suke · June 27, 2019, 8:03am

I face a strange problem, I have 2 models, model1 and model2, model1 is bigger than model 2.
when I test forward time of different model with same input, i need copy to gpu memory.
just like
if self.args.cuda:
device = torch.device(“cuda”)
else:
device = torch.device(“cpu”)
input = input.to(device)

when i collect time cost for “input = input.to(device)”, i found for same input, 2 models cost different time in GPU case:
for model1
input = input.to(device) prepareData Time: 0.12544s
for model2
input = input.to(device) prepareData Time: 0.01899s

when i switch to cpu model, they are same:
for model1
input = input.to(device) prepareData Time: 0.00001s
for model2
input = input.to(device) prepareData Time: 0.00001s

as i know that is just copy to gpu memory, so i think that should be same. that time cost puzzled me.
anyone know why?

ptrblck · June 27, 2019, 10:26am

CUDA operations are called asynchronously in PyTorch.
If you would like to time certain operations, make sure to synchronize before starting and stopping the timer:

torch.cuda.synchronize()
t0 = time.time()
# CUDA op
torch.cuda.synchronize()
t1 = time.time()

Currently your code might accumulate other timings in a host to device copy and thus yield wrong timing results.

suke · June 27, 2019, 11:06am

yes, you are right, after torch.cuda.synchronize() time print is normal.
the most time cost:
time of torch.matmul: 0.02192 s
one more question, does torch.matmul is very slower than conv2d ops.
my test code:
a = torch.randn(1, 153000, 1, 288)
b = torch.randn(153000, 288, 3)
#a = a.half()
#b = b.half()
if True:
device = torch.device(“cuda”)
else:
device = torch.device(“cpu”)
a = a.to(device)
b = b.to(device)
torch.cuda.synchronize()
start = time.process_time()
torch.matmul(a, b)
torch.cuda.synchronize()
end = time.process_time()
print(“Elapsed time of {}: {} s”.format(‘torch.matmul’, np.round(end - start, decimals=5)))