Hello!
I’m trying to understand why I’m seeing such a performance gap between my regular conv2d C implementation and torch.conv2d on CPU.
A single torch.conv2d call with a 512x512 matrix and a 256x256 kernel as arguments takes 59.159 secs to execute whereas a very basic C implementation runs in only 0.650 sec !
I expected it to be the other way around but no: the regular C implementation is about 91 times faster despite running on a single core and having no particular optimization (I only rearranged the for
loops to improve locality and compiled with -O2 -ffastmath
).
What are the causes of such a performance gap on such a small example?
For reference, here’s the code:
- used to benchmark
torch.conv2d
import torch
import time
x = torch.rand(1, 1, 512, 512)
k = torch.rand(1, 1, 256, 256)
with torch.inference_mode():
start = time.time()
torch.conv2d(x, k)
end = time.time()
print(end - start)
- used to benchmark my implementation (compiled with
-O0
)
#include <stdio.h>
#include <time.h>
int main(void) {
struct Tensor *a = tensor_init_random(512, 512);
struct Tensor *b = tensor_init_random(256, 256);
struct Tensor *c;
clock_t start, end;
start = clock();
c = conv2d(a, b, 0, 1);
end = clock();
tensor_free(c);
printf("%f\n", difftime(end, start)/CLOCKS_PER_SEC);
return (0);
}
- The C implementation of conv2d (compiled with
-O2 -ffast-math
)
struct Tensor *conv2d(struct Tensor *tensor, struct Tensor *kernel,
float const bias, uint32_t const stride) {
__assert_non_zero_stride(stride);
__assert_conv_kernel_size(tensor, kernel);
uint32_t const h = ((tensor->shape[0] - (kernel->shape[0] - 1) - 1) / stride) + 1;
uint32_t const w = ((tensor->shape[1] - (kernel->shape[1] - 1) - 1) / stride) + 1;
struct Tensor *out = tensor_init_constant(h, w, bias);
for (uint32_t i = 0; i < h; i++)
for (uint32_t k = 0; k < kernel->shape[0]; k++)
for (uint32_t l = 0; l < kernel->shape[1]; l++)
for (uint32_t j = 0; j < w; j++)
out->data[i * w + j] += kernel->data[k * kernel->shape[1] + l] * tensor->data[(i + k) * tensor->shape[1] + l + j];
return (out);
}
- Some env informations
PyTorch version: 1.10.0
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: macOS 12.0.1 (arm64)
Clang version: 12.0.5 (clang-1205.0.22.9)
Python platform: macOS-12.0.1-arm64-arm-64bit
Python version: 3.9.7
CUDA runtime version: No CUDA