Hello!

I’m trying to understand why I’m seeing such a performance gap between my regular conv2d C implementation and torch.conv2d on CPU.

A single torch.conv2d call with a 512x512 matrix and a 256x256 kernel as arguments takes 59.159 secs to execute whereas a very basic C implementation runs in only 0.650 sec !

I expected it to be the other way around but no: the regular C implementation is about 91 times faster despite running on a single core and having no particular optimization (I only rearranged the `for`

loops to improve locality and compiled with `-O2 -ffastmath`

).

What are the causes of such a performance gap on such a small example?

For reference, here’s the code:

- used to benchmark
`torch.conv2d`

```
import torch
import time
x = torch.rand(1, 1, 512, 512)
k = torch.rand(1, 1, 256, 256)
with torch.inference_mode():
start = time.time()
torch.conv2d(x, k)
end = time.time()
print(end - start)
```

- used to benchmark my implementation (compiled with
`-O0`

)

```
#include <stdio.h>
#include <time.h>
int main(void) {
struct Tensor *a = tensor_init_random(512, 512);
struct Tensor *b = tensor_init_random(256, 256);
struct Tensor *c;
clock_t start, end;
start = clock();
c = conv2d(a, b, 0, 1);
end = clock();
tensor_free(c);
printf("%f\n", difftime(end, start)/CLOCKS_PER_SEC);
return (0);
}
```

- The C implementation of conv2d (compiled with
`-O2 -ffast-math`

)

```
struct Tensor *conv2d(struct Tensor *tensor, struct Tensor *kernel,
float const bias, uint32_t const stride) {
__assert_non_zero_stride(stride);
__assert_conv_kernel_size(tensor, kernel);
uint32_t const h = ((tensor->shape[0] - (kernel->shape[0] - 1) - 1) / stride) + 1;
uint32_t const w = ((tensor->shape[1] - (kernel->shape[1] - 1) - 1) / stride) + 1;
struct Tensor *out = tensor_init_constant(h, w, bias);
for (uint32_t i = 0; i < h; i++)
for (uint32_t k = 0; k < kernel->shape[0]; k++)
for (uint32_t l = 0; l < kernel->shape[1]; l++)
for (uint32_t j = 0; j < w; j++)
out->data[i * w + j] += kernel->data[k * kernel->shape[1] + l] * tensor->data[(i + k) * tensor->shape[1] + l + j];
return (out);
}
```

- Some env informations

```
PyTorch version: 1.10.0
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: macOS 12.0.1 (arm64)
Clang version: 12.0.5 (clang-1205.0.22.9)
Python platform: macOS-12.0.1-arm64-arm-64bit
Python version: 3.9.7
CUDA runtime version: No CUDA
```