Conv2d is 91x slower than regular C implementation

Hello!

I’m trying to understand why I’m seeing such a performance gap between my regular conv2d C implementation and torch.conv2d on CPU.

A single torch.conv2d call with a 512x512 matrix and a 256x256 kernel as arguments takes 59.159 secs to execute whereas a very basic C implementation runs in only 0.650 sec !

I expected it to be the other way around but no: the regular C implementation is about 91 times faster despite running on a single core and having no particular optimization (I only rearranged the for loops to improve locality and compiled with -O2 -ffastmath).

What are the causes of such a performance gap on such a small example?


For reference, here’s the code:

  1. used to benchmark torch.conv2d
import torch
import time
 
x = torch.rand(1, 1, 512, 512)
k = torch.rand(1, 1, 256, 256)
with torch.inference_mode():
    start = time.time()
    torch.conv2d(x, k)
    end = time.time()
print(end - start)
  1. used to benchmark my implementation (compiled with -O0)
#include <stdio.h>
#include <time.h>
 
int         main(void) {
    struct Tensor   *a = tensor_init_random(512, 512);
    struct Tensor   *b = tensor_init_random(256, 256);
    struct Tensor   *c;
    clock_t     start, end;
 
    start = clock();
    c = conv2d(a, b, 0, 1);
    end = clock();
 
    tensor_free(c);
    printf("%f\n", difftime(end, start)/CLOCKS_PER_SEC);
    return (0);
}
  1. The C implementation of conv2d (compiled with -O2 -ffast-math)
struct Tensor       *conv2d(struct Tensor *tensor, struct Tensor *kernel, 
                            float const bias, uint32_t const stride) {
    __assert_non_zero_stride(stride);
    __assert_conv_kernel_size(tensor, kernel);
    uint32_t const  h = ((tensor->shape[0] - (kernel->shape[0] - 1) - 1) / stride) + 1;
    uint32_t const  w = ((tensor->shape[1] - (kernel->shape[1] - 1) - 1) / stride) + 1;
    struct Tensor   *out = tensor_init_constant(h, w, bias);
    for (uint32_t i = 0; i < h; i++)
        for (uint32_t k = 0; k < kernel->shape[0]; k++)
            for (uint32_t l = 0; l < kernel->shape[1]; l++)
                for (uint32_t j = 0; j < w; j++)
                    out->data[i * w + j] += kernel->data[k * kernel->shape[1] + l] * tensor->data[(i + k) * tensor->shape[1] + l + j];
    return (out);
}
  1. Some env informations
PyTorch version: 1.10.0
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: macOS 12.0.1 (arm64)
Clang version: 12.0.5 (clang-1205.0.22.9)
Python platform: macOS-12.0.1-arm64-arm-64bit
Python version: 3.9.7
CUDA runtime version: No CUDA

The same operation runs in around 1.663 sec using tensorflow.
Is Pytorch less optimized for ARM architectures?