Why convlution is even faster than torch.sort?

As far as I concern, sort is O(Nlog(N)) algorithm but convolution is larger than that. But why is convolution so fast at cpu?
Here is the code:

import torch
from torchvision.models import mobilenet_v2
import time
input = torch.rand([64,3,224,224])
model = mobilenet_v2(pretrained= True).features[0:1]
time_sort_avg = 0.0
time_conv_avg = 0.0
for i in range(50):
    start =time.time()
    output = model(input)
    time_conv = time.time() -start
    output = output.view(-1)
    start =time.time()
    sorted,index = torch.sort(output)
    time_sort = time.time() -start
    time_sort_avg += time_sort
    time_conv_avg += time_conv
print("sort time avg:",time_sort_avg/50)
print("conv time avg:",time_conv_avg/50)

And here is the result

sort time avg: 2.178560104370117
conv time avg: 0.07962306022644043

There are are going to be many factors in play here, but many of them will boil down to computer architecture and optimizations. A typical modern CPU is going to have pretty wide execution units (consider 256/512-bit instructions via AVX-256/512), which can get even wider if you consider multithreaded implementations. A library implementation for conv will often have handcrafted implementations to leverage these wide vector extensions to extract as much parallelism as possible as the CPU will support. On the other hand, I believe torch.sort will dispatch to std::sort under the hood, which may not use all of the fancy vector extensions available on modern CPUs and shouldn’t do multithreading. A lot of engineering time has been spent getting conv to go as fast as possible for DL applications.

Another perspective is that convolution is an operation that can have very high arithmetic intensity: read a weight or input once and it may be multiplied and accumulated hundreds of times (e.g., per output channel AND per sliding window). On the other hand if we think about how often an input might be compared/written in sort, it’s log(N). It turns out that this kind of repeated access per the same element can be very cheap on modern hardware given the cache heirarchy.

There’s even more things that I glossed over but they add up in making the “constant factor” in complexity wildly different between something like torch.sort and a highly specialized convolution implementation that you would find in oneDNN/MKLDNN.

Wow, really specific! Thank you very much.