Abnormal memory cost

I face a strange problem, I have 2 models, model1 and model2, model1 is bigger than model 2.
when I test forward time of different model with same input, i need copy to gpu memory.
just like
if self.args.cuda:
device = torch.device(“cuda”)
else:
device = torch.device(“cpu”)
input = input.to(device)

when i collect time cost for “input = input.to(device)”, i found for same input, 2 models cost different time in GPU case:
for model1
input = input.to(device) prepareData Time: 0.12544s
for model2
input = input.to(device) prepareData Time: 0.01899s

when i switch to cpu model, they are same:
for model1
input = input.to(device) prepareData Time: 0.00001s
for model2
input = input.to(device) prepareData Time: 0.00001s

as i know that is just copy to gpu memory, so i think that should be same. that time cost puzzled me.
anyone know why?

CUDA operations are called asynchronously in PyTorch.
If you would like to time certain operations, make sure to synchronize before starting and stopping the timer:

torch.cuda.synchronize()
t0 = time.time()
# CUDA op
torch.cuda.synchronize()
t1 = time.time()

Currently your code might accumulate other timings in a host to device copy and thus yield wrong timing results.

yes, you are right, after torch.cuda.synchronize() time print is normal.
the most time cost:
time of torch.matmul: 0.02192 s
one more question, does torch.matmul is very slower than conv2d ops.
my test code:
a = torch.randn(1, 153000, 1, 288)
b = torch.randn(153000, 288, 3)
#a = a.half()
#b = b.half()
if True:
device = torch.device(“cuda”)
else:
device = torch.device(“cpu”)
a = a.to(device)
b = b.to(device)
torch.cuda.synchronize()
start = time.process_time()
torch.matmul(a, b)
torch.cuda.synchronize()
end = time.process_time()
print(“Elapsed time of {}: {} s”.format(‘torch.matmul’, np.round(end - start, decimals=5)))