Slow Big Matrix Multiplication Slower on GPU

I’m using basic matrix multiplication for two 2D vectors. But the GPU version is much slower than CPU. Is that an expected performance? And why there is a lot of CPU computation time when calling torch.mm for CUDA tensors?

import torch

a = torch.randn(1000000, 3)
b = torch.randn(1000000, 3)

with torch.autograd.profiler.profile(use_cuda=True) as prof:
    torch.mm(a.T, b)
print(prof.key_averages().table(sort_by="self_cpu_time_total"))

with torch.autograd.profiler.profile(use_cuda=True) as prof:
    torch.mm(a.cuda().T, b.cuda())
print(prof.key_averages().table(sort_by="self_cpu_time_total"))

The output profile result is

-----------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-----------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
mm           98.78%           12.050ms         98.78%           12.050ms         12.050ms         97.98%           12.057ms         12.057ms         1                
permute      0.83%            101.190us        0.83%            101.190us        101.190us        0.82%            100.992us        100.992us        1                
numpy_T      0.39%            47.549us         1.22%            148.739us        148.739us        1.20%            147.360us        147.360us        1                
-----------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 12.199ms
CUDA time total: 12.305ms

-----------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-----------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
mm           98.91%           331.972ms        98.91%           331.972ms        331.972ms        98.69%           332.807ms        332.807ms        1                
to           0.87%            2.905ms          1.08%            3.639ms          1.819ms          1.08%            3.638ms          1.819ms          2                
empty        0.22%            733.251us        0.22%            733.251us        366.625us        0.22%            739.072us        369.536us        2                
permute      0.01%            19.357us         0.01%            19.357us         19.357us         0.01%            19.552us         19.552us         1                
numpy_T      0.00%            11.390us         0.01%            30.747us         30.747us         0.01%            30.464us         30.464us         1                
-----------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 335.642ms
CUDA time total: 337.234ms

I think most of the time is spent sending data to the GPU, instead of calculating multiplication on GPU.

import torch

a = torch.randn(1000000, 3)
b = torch.randn(1000000, 3)

with torch.autograd.profiler.profile(use_cuda=True) as prof:
    torch.mm(a.T, b)
print(prof.key_averages().table(sort_by="self_cpu_time_total"))
# Self CPU time total: 12.777ms
# CUDA time total: 12.951ms

a, b = a.cuda(), b.cuda()
with torch.autograd.profiler.profile(use_cuda=True) as prof:
    torch.mm(a.T, b)
print(prof.key_averages().table(sort_by="self_cpu_time_total"))
# Self CPU time total: 1.389ms
# CUDA time total: 1.8451ms

I use RTX 2080ti and Xeon® E5-2678 v3

1 Like

Thanks for your reply. But why there is also 1.389ms cpu time in torch.mm?