Slow Big Matrix Multiplication Slower on GPU

ecr23 · January 3, 2020, 3:40am

I’m using basic matrix multiplication for two 2D vectors. But the GPU version is much slower than CPU. Is that an expected performance? And why there is a lot of CPU computation time when calling torch.mm for CUDA tensors?

import torch

a = torch.randn(1000000, 3)
b = torch.randn(1000000, 3)

with torch.autograd.profiler.profile(use_cuda=True) as prof:
    torch.mm(a.T, b)
print(prof.key_averages().table(sort_by="self_cpu_time_total"))

with torch.autograd.profiler.profile(use_cuda=True) as prof:
    torch.mm(a.cuda().T, b.cuda())
print(prof.key_averages().table(sort_by="self_cpu_time_total"))

The output profile result is

-----------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-----------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
mm           98.78%           12.050ms         98.78%           12.050ms         12.050ms         97.98%           12.057ms         12.057ms         1                
permute      0.83%            101.190us        0.83%            101.190us        101.190us        0.82%            100.992us        100.992us        1                
numpy_T      0.39%            47.549us         1.22%            148.739us        148.739us        1.20%            147.360us        147.360us        1                
-----------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 12.199ms
CUDA time total: 12.305ms

-----------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-----------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
mm           98.91%           331.972ms        98.91%           331.972ms        331.972ms        98.69%           332.807ms        332.807ms        1                
to           0.87%            2.905ms          1.08%            3.639ms          1.819ms          1.08%            3.638ms          1.819ms          2                
empty        0.22%            733.251us        0.22%            733.251us        366.625us        0.22%            739.072us        369.536us        2                
permute      0.01%            19.357us         0.01%            19.357us         19.357us         0.01%            19.552us         19.552us         1                
numpy_T      0.00%            11.390us         0.01%            30.747us         30.747us         0.01%            30.464us         30.464us         1                
-----------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 335.642ms
CUDA time total: 337.234ms

Eta_C · January 3, 2020, 6:49am

I think most of the time is spent sending data to the GPU, instead of calculating multiplication on GPU.

import torch

a = torch.randn(1000000, 3)
b = torch.randn(1000000, 3)

with torch.autograd.profiler.profile(use_cuda=True) as prof:
    torch.mm(a.T, b)
print(prof.key_averages().table(sort_by="self_cpu_time_total"))
# Self CPU time total: 12.777ms
# CUDA time total: 12.951ms

a, b = a.cuda(), b.cuda()
with torch.autograd.profiler.profile(use_cuda=True) as prof:
    torch.mm(a.T, b)
print(prof.key_averages().table(sort_by="self_cpu_time_total"))
# Self CPU time total: 1.389ms
# CUDA time total: 1.8451ms

I use RTX 2080ti and Xeon® E5-2678 v3

ecr23 · January 3, 2020, 9:07am

Thanks for your reply. But why there is also 1.389ms cpu time in torch.mm?