I’m using basic matrix multiplication for two 2D vectors. But the GPU version is much slower than CPU. Is that an expected performance? And why there is a lot of CPU computation time when calling torch.mm
for CUDA tensors?
import torch
a = torch.randn(1000000, 3)
b = torch.randn(1000000, 3)
with torch.autograd.profiler.profile(use_cuda=True) as prof:
torch.mm(a.T, b)
print(prof.key_averages().table(sort_by="self_cpu_time_total"))
with torch.autograd.profiler.profile(use_cuda=True) as prof:
torch.mm(a.cuda().T, b.cuda())
print(prof.key_averages().table(sort_by="self_cpu_time_total"))
The output profile result is
----------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls
----------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
mm 98.78% 12.050ms 98.78% 12.050ms 12.050ms 97.98% 12.057ms 12.057ms 1
permute 0.83% 101.190us 0.83% 101.190us 101.190us 0.82% 100.992us 100.992us 1
numpy_T 0.39% 47.549us 1.22% 148.739us 148.739us 1.20% 147.360us 147.360us 1
----------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Self CPU time total: 12.199ms
CUDA time total: 12.305ms
----------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls
----------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
mm 98.91% 331.972ms 98.91% 331.972ms 331.972ms 98.69% 332.807ms 332.807ms 1
to 0.87% 2.905ms 1.08% 3.639ms 1.819ms 1.08% 3.638ms 1.819ms 2
empty 0.22% 733.251us 0.22% 733.251us 366.625us 0.22% 739.072us 369.536us 2
permute 0.01% 19.357us 0.01% 19.357us 19.357us 0.01% 19.552us 19.552us 1
numpy_T 0.00% 11.390us 0.01% 30.747us 30.747us 0.01% 30.464us 30.464us 1
----------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Self CPU time total: 335.642ms
CUDA time total: 337.234ms