Pytorch matmul in GPU is slower than CPU

I am comparing how much faster is the matmul on GPU, surprisingly, my test result shows that running on a GPU is slower than running on a CPU.

import torch
import numpy as np

'''
Tensor shape = (batch,
                attention heads,
                features per head,
                height,
                weight,
                attention window
                )
Goal: We want to apply dot product
to only the last dimension
'''
# softmax score for the Query and Key
QK = torch.randn([64, 8, 4, 28, 28, 9])
# Value Tensor 
V = torch.randn([64, 8, 4, 28, 28, 9])

def method1(QK, V):
    """matmul way"""
    
    # Preparing the right shape for dot product in the lastt dimension
    out1 = torch.matmul(QK.unsqueeze(-2), V.unsqueeze(-1))
    # Reshape it back to the original shape
    out1 = out1.squeeze(-1).squeeze(-1)
    return out1

def method2(QK,V):
    """Einstein summation"""
    return torch.einsum('bnchwk,bnchwk -> bnchw', QK, V)

# torch CPU

%timeit -n 500 method1(QK,V)
%timeit -n 500 method2(QK,V)

# torch GPU

QK = QK.cuda()
V = V.cuda()

%timeit -n 500 method1(QK,V)
%timeit -n 500 method2(QK,V)

Torch CPU
Method1: 2.7 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 500 loops each)
Method2: 2.64 ms ± 92.1 µs per loop (mean ± std. dev. of 7 runs, 500 loops each)

Torch GPU
Method1: 3.34 ms ± 206 µs per loop (mean ± std. dev. of 7 runs, 500 loops each)
Method1: 3.43 ms ± 123 µs per loop (mean ± std. dev. of 7 runs, 500 loops each)

Am I doing anything wrong? The result doesn’t make sense to me.

CUDA operations are asynchronously executed so you would need to synchronize your code before starting and stopping the timers via torch.cuda.synchronize() to get proper timings. Also, the operations could be executed in a loop to calculate the average time and reduce the noise a bit.

That being said, if your workload is small you won’t be able to saturate the GPU (and would also have the kernel launch overheads) and the CPU might be indeed faster.