PyTorch matmul >3x slower than Tensorflow on AMD CPU

Hi, I recently noticed that matrix multiplication on my AMD Ryzen CPU is significantly faster in Tensorflow than in PyTorch. Is there any way to fix this, by using a different BLAS backend or something? I installed both frameworks using pip (torch==1.13.1 and tensorflow==2.11.0).

Code:

import tensorflow as tf
import torch
from timeit import default_timer as timer

#tf.config.threading.set_inter_op_parallelism_threads(1)
#tf.config.threading.set_intra_op_parallelism_threads(1)
shape = (8192, 8192)
with tf.device('/CPU:0'):
    x = tf.zeros(shape)
    tf.matmul(x, x)  # warmup
    start = timer()
    tf.matmul(x, x)
    end = timer()
    print(f'Tensorflow time: {end-start:g} s')

#torch.set_num_threads(1)
#torch.set_num_interop_threads(1)
x = torch.zeros(*shape)
torch.matmul(x, x)  # warmup
start = timer()
torch.matmul(x, x)
end = timer()
print(f'PyTorch time: {end-start:g} s')

Output:

Tensorflow time: 2.02568 s
PyTorch time: 6.78812 s

The problem also exists for batch matrix multiplication, and also on a M1 Mac (to a lesser extent), but apparently not on an Intel CPU. I also noticed that Tensorflow uses the 16 virtual cores fully while PyTorch only uses the 8 logical ones, but the problem persists when limiting both to only use one thread (12.7s vs 34.5s).

1 Like