Yes, the GPU executes all operations asynchronously, so you need to insert proper barriers for your benchmarks to be correct. Also, if you’re using Python 3, I’d recommend using time.perf_counter()
instead of time.time()
. Here’s a corrected script:
x = torch.cuda.FloatTensor(10000, 500).normal_()
w = torch.cuda.FloatTensor(200, 500).normal_()
# ensure that context initialization and normal_() operations
# finish before you start measuring time
torch.cuda.synchronize()
torch.cuda.synchronize()
a = time.perf_counter()
y = x.mm(w.t())
torch.cuda.synchronize() # wait for mm to finish
b = time.perf_counter()
print('batch GPU {:.02e}s'.format(b - a))
a = time.perf_counter()
y = x.mm(w.t())
torch.cuda.synchronize() # wait for mm to finish
b = time.perf_counter()
print('batch GPU {:.02e}s'.format(b - a))
That said, it still gives me some weird results. Even with proper synchronization, running this timing block in a loop gives me:
batch GPU 1.64e-01s
batch GPU 1.25e-03s
batch GPU 7.01e-04s
batch GPU 6.96e-04s
batch GPU 6.94e-04s
@ngimel any ideas what might be causing it?