Convolution speedup (slowdown) with torch.compile

Hi,

I am comparing the performance of convolutions using both a regular timer and the torch profiler.

#a variety of (kernel,stride,padding) are used
conv = nn.Conv2d(in_channels,out_channels,kernel,stride=stride,padding=padding,bias=False,device=DEVICE)
convComp = torch.compile(conv)

# Regular timer
st = time.perf_counter_ns()
output = conv(input)
et = time.perf_counter_ns()

st = time.perf_counter_ns()
output = convComp(input)
et = time.perf_counter_ns()

# With profiler
with torch.autograd.profiler.profile(record_shapes=True, profile_memory=True, with_stack=True) as prof:
     output = conv(input)
with torch.autograd.profiler.profile(record_shapes=True, profile_memory=True, with_stack=True) as prof:
     output = convComp(input)

All experiemts are being run on the GPU, so CUDA or Triton. I have tried with and without cuda.synchronize(), as well as torch.no_grad().

Per the torch profiler, there are massive speedups (up to 8x in some configurations, most around 1.5x). However per the regular timer, all configurations are slower (down to .2x speedup, most around .6x).

I am curious what accounts for this large discrepancy in duration. Any ideas would be greatly appreciated

Thanks!

Profiling CUDA kernels without a synchronization is invalid as the kernels are executed asynchronously and you would profile the host only (including the dispatching, compiling, etc.).

Good to know. However, I have done the experiment with synchronization, and the results are the same

torch.compile() is a JIT compiler and the first time it’ll compile is when it sees the first input so in your first measurement you are measuring both the compilation time and inference time. Wheras in your second measurement with the profiler only the inference time

I ran each convolution 10 times, and took the median time for each configuration. The two methods of profiling were not run sequentially, I wanted to create a simple presentation for this post.