Hi,
I am comparing the performance of convolutions using both a regular timer and the torch profiler.
#a variety of (kernel,stride,padding) are used
conv = nn.Conv2d(in_channels,out_channels,kernel,stride=stride,padding=padding,bias=False,device=DEVICE)
convComp = torch.compile(conv)
# Regular timer
st = time.perf_counter_ns()
output = conv(input)
et = time.perf_counter_ns()
st = time.perf_counter_ns()
output = convComp(input)
et = time.perf_counter_ns()
# With profiler
with torch.autograd.profiler.profile(record_shapes=True, profile_memory=True, with_stack=True) as prof:
output = conv(input)
with torch.autograd.profiler.profile(record_shapes=True, profile_memory=True, with_stack=True) as prof:
output = convComp(input)
All experiemts are being run on the GPU, so CUDA or Triton. I have tried with and without cuda.synchronize(), as well as torch.no_grad().
Per the torch profiler, there are massive speedups (up to 8x in some configurations, most around 1.5x). However per the regular timer, all configurations are slower (down to .2x speedup, most around .6x).
I am curious what accounts for this large discrepancy in duration. Any ideas would be greatly appreciated
Thanks!