Understanding Time Performance of F.Conv2d

In my project, I am trying to compare two techniques:

  • spatial convolution and filtering via FFT.

At this point I am only focused on the number of operations. Optimizing parallelism is not part of my scope.

So, in theory, the techniques I am comparing against the standard ones should be fasts due to have less operation. But that hasn’t be the case when I time both executions. I am assuming that, though running in CPU, there is still a lot of parallelism going on.

So, how do I turn that off?

From the torch notes I am assuming it would be something like this:

torch.set_num_interop_threads(1)
torch.set_num_threads(1)

But when timing both techniques, that still doesn’t give the results I am looking for, meaning, spatial convolution is still taking less time than filtering via FFT.


PS: These are the accumulated timing for 128 runs for

  • Standard Convolution
  • Custom implementation of Convolution, based on GEMM method
  • FFT filtering python implementation
  • FFT filtering C++ extension implementation
# time_default, time_custom, time_python, time_cpp
0.8500628471374512, 5.313140630722046, 2.029076337814331, 1.942748785018921

# time_default, time_custom, time_python, time_cpp
# Setting threads num. to 1
0.894277811050415, 5.291566371917725, 1.9720547199249268, 1.8894264698028564

How come the standard conv2d is way faster when it suppose to be slower?

What does “standard conv2d” mean in this case? Are you referring to just calling F.conv2d as-is? Even with a single thread, this will typically dispatch to an MKLDNN implementation on many systems, which means it will use a highly optimized vectorized and memory-hierarchy-aware implementation. You are correct in that there will still be implicit parallelism here due to the use of vectorized instructions. On the other hand, even without parallelism, performance is going to be highly dependent on how the computation aligns with a CPU’s caches (which again is going to be highly optimized here).

So in practice, even if there is some theoretical complexity improvement, this improvement can often be difficult to realize given that different algorithms will often have different memory access patterns and different levels of vectorization friendliness. It becomes even more difficult to compare when the baseline is a highly tuned library implementation (as F.conv2d will use in this case).