As title described, the following code snippet give different results (not minor difference) ?
torch profile : CPU time total : 446.860ms, CUDA time total : 122.922ms
time.perf_counter : 0.3481s ~ 348.1ms (total elapsed time of self.model(model_input)
)
Code snippet i used to profile :
# torch profile
self.model.eval()
for _ in range(30):
_, _, _ = self.model(model_input)
print('warm up done!')
with torch.no_grad():
torch.cuda.current_stream(self.device).synchronize()
with profile(activities=[ProfilerActivity.CUDA, ProfilerActivity.CPU], with_stack=True, with_flops=True) as prof:
pred_speed_wps, pred_route, language = self.model(model_input)
prof.step()
print(prof.key_averages(group_by_stack_n=5).table(sort_by="flops"))
vs.
# time.perf_counter
class Tch_prof(object):
def __init__(self, device):
self.device = device
self.hw_type = 'gpu'
self.tlt_time = {
'cpu' : 0,
'gpu' : 0
}
def __enter__(self):
torch.cuda.current_stream(self.device).synchronize()
self.s = time.perf_counter()
def __exit__(self, *exc):
torch.cuda.current_stream(self.device).synchronize()
self.tlt_time[self.hw_type] += time.perf_counter() - self.s
def get_profile(self, hw_type='all'):
if hw_type == 'all':
return self.tlt_time
elif hw_type in self.tlt_time.keys():
return self.tlt_time[hw_type]
else:
raise RuntimeError(f"No such hardware type {hw_type}")
# run_step code snippet :
self.model.eval()
for _ in range(30):
_, _, _ = self.model(model_input)
print('warm up done!')
with torch.no_grad():
prof = Tch_prof(device=self.device)
with prof:
pred_speed_wps, pred_route, language = self.model(model_input)
print(prof.get_profile())
For using time.prof_counter, i already had applied torch.cuda.current_stream(self.device).synchronize() and expect this is the correct way to measure the time elapsed, but still different from the torch profileā¦orz ?
If so, which one is the correct way to do time benchmarking of torch model in this case ?
Any suggestion will be helpful ~