Why torch.profile give slower exec time then time.prof_counter with torch.cuda.current_stream(self.device).synchronize()?

As title described, the following code snippet give different results (not minor difference) ?

torch profile : CPU time total : 446.860ms, CUDA time total : 122.922ms

time.perf_counter : 0.3481s ~ 348.1ms (total elapsed time of self.model(model_input))

Code snippet i used to profile :

# torch profile
self.model.eval()

for _ in range(30):
    _, _, _ = self.model(model_input)
print('warm up done!')
            
with torch.no_grad():
     torch.cuda.current_stream(self.device).synchronize()
     with profile(activities=[ProfilerActivity.CUDA, ProfilerActivity.CPU], with_stack=True, with_flops=True) as prof:
          pred_speed_wps, pred_route, language = self.model(model_input)
          prof.step()

      print(prof.key_averages(group_by_stack_n=5).table(sort_by="flops"))

vs.

# time.perf_counter

class Tch_prof(object):
    def __init__(self, device):
        self.device = device
        self.hw_type = 'gpu'
        self.tlt_time = {
            'cpu' : 0,
            'gpu' : 0
        }

    def __enter__(self):
        torch.cuda.current_stream(self.device).synchronize()
        self.s = time.perf_counter()
         
    def __exit__(self, *exc):
        torch.cuda.current_stream(self.device).synchronize()
        self.tlt_time[self.hw_type] += time.perf_counter() - self.s
        
    def get_profile(self, hw_type='all'):
        if hw_type == 'all':
            return self.tlt_time
        elif hw_type in self.tlt_time.keys():
            return self.tlt_time[hw_type]
        else:
            raise RuntimeError(f"No such hardware type {hw_type}") 

# run_step code snippet :
self.model.eval()

for _ in range(30):
    _, _, _ = self.model(model_input)
    print('warm up done!')
            
    with torch.no_grad():
         prof = Tch_prof(device=self.device)
         with prof:
              pred_speed_wps, pred_route, language = self.model(model_input)
        
    print(prof.get_profile())

For using time.prof_counter, i already had applied torch.cuda.current_stream(self.device).synchronize() and expect this is the correct way to measure the time elapsed, but still different from the torch profile…orz ?

If so, which one is the correct way to do time benchmarking of torch model in this case ?

Any suggestion will be helpful ~