I am comparing how much faster is the matmul on GPU, surprisingly, my test result shows that running on a GPU is slower than running on a CPU.
import torch
import numpy as np
'''
Tensor shape = (batch,
attention heads,
features per head,
height,
weight,
attention window
)
Goal: We want to apply dot product
to only the last dimension
'''
# softmax score for the Query and Key
QK = torch.randn([64, 8, 4, 28, 28, 9])
# Value Tensor
V = torch.randn([64, 8, 4, 28, 28, 9])
def method1(QK, V):
"""matmul way"""
# Preparing the right shape for dot product in the lastt dimension
out1 = torch.matmul(QK.unsqueeze(-2), V.unsqueeze(-1))
# Reshape it back to the original shape
out1 = out1.squeeze(-1).squeeze(-1)
return out1
def method2(QK,V):
"""Einstein summation"""
return torch.einsum('bnchwk,bnchwk -> bnchw', QK, V)
# torch CPU
%timeit -n 500 method1(QK,V)
%timeit -n 500 method2(QK,V)
# torch GPU
QK = QK.cuda()
V = V.cuda()
%timeit -n 500 method1(QK,V)
%timeit -n 500 method2(QK,V)
Torch CPU
Method1: 2.7 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 500 loops each)
Method2: 2.64 ms ± 92.1 µs per loop (mean ± std. dev. of 7 runs, 500 loops each)
Torch GPU
Method1: 3.34 ms ± 206 µs per loop (mean ± std. dev. of 7 runs, 500 loops each)
Method1: 3.43 ms ± 123 µs per loop (mean ± std. dev. of 7 runs, 500 loops each)
Am I doing anything wrong? The result doesn’t make sense to me.