Measuring GPU tensor operation speed

Yes, the GPU executes all operations asynchronously, so you need to insert proper barriers for your benchmarks to be correct. Also, if you’re using Python 3, I’d recommend using time.perf_counter() instead of time.time(). Here’s a corrected script:

x = torch.cuda.FloatTensor(10000, 500).normal_()
w = torch.cuda.FloatTensor(200, 500).normal_()

# ensure that context initialization and normal_() operations
# finish before you start measuring time
torch.cuda.synchronize()
torch.cuda.synchronize()

a = time.perf_counter()
y = x.mm(w.t())
torch.cuda.synchronize() # wait for mm to finish
b = time.perf_counter()
print('batch GPU {:.02e}s'.format(b - a))

a = time.perf_counter()
y = x.mm(w.t())
torch.cuda.synchronize() # wait for mm to finish
b = time.perf_counter()
print('batch GPU {:.02e}s'.format(b - a))

That said, it still gives me some weird results. Even with proper synchronization, running this timing block in a loop gives me:

batch GPU 1.64e-01s
batch GPU 1.25e-03s
batch GPU 7.01e-04s
batch GPU 6.96e-04s
batch GPU 6.94e-04s

@ngimel any ideas what might be causing it?

8 Likes