Hi, I recently noticed that matrix multiplication on my AMD Ryzen CPU is significantly faster in Tensorflow than in PyTorch. Is there any way to fix this, by using a different BLAS backend or something? I installed both frameworks using pip (torch==1.13.1 and tensorflow==2.11.0).
Code:
import tensorflow as tf
import torch
from timeit import default_timer as timer
#tf.config.threading.set_inter_op_parallelism_threads(1)
#tf.config.threading.set_intra_op_parallelism_threads(1)
shape = (8192, 8192)
with tf.device('/CPU:0'):
x = tf.zeros(shape)
tf.matmul(x, x) # warmup
start = timer()
tf.matmul(x, x)
end = timer()
print(f'Tensorflow time: {end-start:g} s')
#torch.set_num_threads(1)
#torch.set_num_interop_threads(1)
x = torch.zeros(*shape)
torch.matmul(x, x) # warmup
start = timer()
torch.matmul(x, x)
end = timer()
print(f'PyTorch time: {end-start:g} s')
Output:
Tensorflow time: 2.02568 s
PyTorch time: 6.78812 s
The problem also exists for batch matrix multiplication, and also on a M1 Mac (to a lesser extent), but apparently not on an Intel CPU. I also noticed that Tensorflow uses the 16 virtual cores fully while PyTorch only uses the 8 logical ones, but the problem persists when limiting both to only use one thread (12.7s vs 34.5s).