Strange time profiling results on GPU

I wrote a simple program to profile some basic pytorch functions:

import numpy as np
import torch
from torch.autograd import Variable

use_cuda = True

if use_cuda:
    dtype = torch.cuda.FloatTensor
else:
    dtype = torch.FloatTensor

x = Variable(torch.randn(2000, 2000).type(dtype))

@profile
def test():
    a = x + 0.123
    b = torch.mean(x)
    c = torch.sum(x)
    d = torch.max(x)


for i in range(100):
    test()

I run the programm with kernprof -lv test.py. The result on GPU is:

Wrote profile results to test.py.lprof
Timer unit: 1e-06 s

Total time: 0.29263 s
File: test.py
Function: test at line 14

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    14                                           @profile
    15                                           def test():
    16       100         2648     26.5      0.9      a = x + 0.123
    17       100        36085    360.9     12.3      b = torch.mean(x)
    18       100        17267    172.7      5.9      c = torch.sum(x)
    19       100       236630   2366.3     80.9      d = torch.max(x)

My question is: why there are such big differences between the run time of these 4 functions?
As a comparision, the result on CPU looks much reasonable to me:

Wrote profile results to test.py.lprof
Timer unit: 1e-06 s

Total time: 2.1066 s
File: test.py
Function: test at line 14

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    14                                           @profile
    15                                           def test():
    16       100       214073   2140.7     10.2      a = x + 0.123
    17       100       347015   3470.2     16.5      b = torch.mean(x)
    18       100       344269   3442.7     16.3      c = torch.sum(x)
    19       100      1201245  12012.5     57.0      d = torch.max(x)
1 Like

The CUDA API is asynchronous in nature. Maybe you are not actually getting the real runtime.

Run your program with:

CUDA_LAUNCH_BLOCKING=1 python my_program.py

This will for the CUDA API to be synchronous for the lifetime of the program.

2 Likes

Thanks for the hint. I run it again acoording to your suggestion and the result makes more sense now. However, max is still very slow compare to the other operations, do you know why? Here is the new result:

Wrote profile results to test.py.lprof
Timer unit: 1e-06 s

Total time: 0.306049 s
File: test.py
Function: test at line 14

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    14                                           @profile
    15                                           def test():
    16       100        23100    231.0      7.5      a = x + 0.123
    17       100        19004    190.0      6.2      b = torch.mean(x)
    18       100        18352    183.5      6.0      c = torch.sum(x)
    19       100       245593   2455.9     80.2      d = torch.max(x)