Cannot reach theoretical performance with `torch.matmul`

Hi, I’m working with following script for benchmarking my RTX 3080 GPU. In case of torch.float16, torch.matmul is performed with 29 [TFLOPS]. But, in case of torch.float32, torch.matmul is performed with only 9.8 [TFLOPS]. Because the theoretical performance of RTX 3080 is 29.77 [TFLOPS] and the GPU has no half units, the results may be something wrong.

Do you have any information on this?

import torch
import time

torch.backends.cudnn.benchmark = True

size = 8192*2
repeat = 100

with torch.no_grad():

    x = torch.randn(size, size, device=torch.device("cuda"), dtype=torch.float32)
    w = torch.randn(size, size, device=torch.device("cuda"), dtype=torch.float32)
    torch.cuda.synchronize()
    start = time.time()
    for i in range(repeat):
        y = torch.matmul(x, w)
    torch.cuda.synchronize()
    print(time.time() - start, " secs")
    print(size**3 / ((time.time() - start) / repeat) / 1e9, " GFLOPS")
// for torch.float16
15.150275945663452  secs
29029.38026442699  GFLOPS

//for torch.float32
44.60238480567932  secs
9860.550400878896  GFLOPS

I believe that stated number is for TF32 TFLOPs, and you might be disabling TF32 by default depending on your setup.

Could you try running torch.set_float32_matmul_precision('high') before your benchmark and check if that changes the results?

Thank you for your reply.

I inserted the function you told me, then results are changed as follows:

// for torch.float16
14.611926078796387  secs
30098.913633454387  GFLOPS

// for torch.float32
28.9222469329834  secs
15206.416530498134  GFLOPS

In addition, I changed the repeat variable to 10 from 100, then results are changed as follows:

// for torch.float16 (with repeat = 10)
1.8297147750854492  secs
24036.135977214086  GFLOPS

// for torch.float32 (with repeat = 10)
3.2416188716888428  secs
13567.226088349447  GFLOPS

The modified script is as follows:

import torch
import time

torch.backends.cudnn.benchmark = True
torch.set_float32_matmul_precision('high')

size = 8192*2
repeat = 10

with torch.no_grad():

    x = torch.randn(size, size, device=torch.device("cuda"), dtype=torch.float32)
    w = torch.randn(size, size, device=torch.device("cuda"), dtype=torch.float32)
    torch.cuda.synchronize()
    start = time.time()
    for i in range(repeat):
        y = torch.matmul(x, w)
        torch.cuda.synchronize()
    print(time.time() - start, " secs")
    print(size**3 / ((time.time() - start) / repeat) / 1e9, " GFLOPS")

I would also recommend moving the second torch.cuda.synchronize() out of the for loop (so you synchronize only once after running all iterations) and see if that helps somewhat.

I changed the script is as follows:

for i in range(repeat):
        torch.cuda.synchronize()
        start = time.time()
        y = torch.matmul(x, w)
        torch.cuda.synchronize()
        print(size**3 / ((time.time() - start)) / 1e12, " TFLOPS")

So the results changed like this:

// torch.float16
21.586235349878155  TFLOPS
28.278445566394296  TFLOPS
29.119723232835405  TFLOPS
29.780876695030035  TFLOPS
29.610455979112597  TFLOPS
29.58355453208967  TFLOPS
29.539698391619098  TFLOPS
29.567379058328427  TFLOPS
29.4742516322403  TFLOPS
29.494398091095096  TFLOPS

I think that the first one is always low performance because of memory transferring or something.
And this settings, results of float32 case is as follows:

6.2574480417038565  TFLOPS
14.587065018163548  TFLOPS
15.773094195853083  TFLOPS
15.74772563022044  TFLOPS
15.447554939534173  TFLOPS
15.692985973957423  TFLOPS
15.671055399278538  TFLOPS
15.581757487920573  TFLOPS
15.683139624736594  TFLOPS
15.634119138360994  TFLOPS

As you can see from the results, the suggestion to put torch.cuda.synchronize() outside the loop did not work. (In fact, when I checked the results with your suggestion before modifying the script, results was not changed.)

I am able to get ~16TFLOPs with a slightly modified script, but that is still only ~50% of “speed-of-light”

import torch
import time

torch.backends.cudnn.benchmark = True
torch.set_float32_matmul_precision('high')

size = 8192*2
repeat = 20

with torch.no_grad():
    x = torch.randn(size, size, device=torch.device("cuda"), dtype=torch.float)
    w = torch.randn(size, size, device=torch.device("cuda"), dtype=torch.float)
    y = torch.matmul(x, w)
    torch.cuda.synchronize()
    start = time.perf_counter()
    for i in range(repeat):
        y = torch.matmul(x, w)
    torch.cuda.synchronize()
    end = time.perf_counter()
    print(end - start, " secs")
    print(size**3 / ((end - start) / repeat) / 1e9, " GFLOPS")

print("now using CUDA Graphs...")
with torch.no_grad():
   x = torch.randn(size, size, device=torch.device("cuda"), dtype=torch.float)
   w = torch.randn(size, size, device=torch.device("cuda"), dtype=torch.float)
   s = torch.cuda.Stream()
   s.wait_stream(torch.cuda.current_stream())
   with torch.cuda.stream(s):
       for i in range(3):
           y = torch.matmul(x, w)
   torch.cuda.current_stream().wait_stream(s)
   g = torch.cuda.CUDAGraph()
   with torch.cuda.graph(g):
       y = torch.matmul(x, w)
   torch.cuda.synchronize()
   start = time.perf_counter()
   for i in range(repeat):
       g.replay()
   torch.cuda.synchronize()
   end = time.perf_counter()
   print(end - start, " secs")
   print(size**3 / ((end - start) / repeat) / 1e9, " GFLOPS")
5.425337762571871  secs
16212.986927542423  GFLOPS
now using CUDA Graphs...
5.438446355983615  secs
16173.90785243318  GFLOPS

I will follow up to see if there is an explanation for the less than expected performance.

1 Like

Ah, I think we might be undercounting the number of operations e.g., linear algebra - Proof of # of FLOPs in Matrix Multiplication - Mathematics Stack Exchange

which would suggest the correct number is something closer to 2sizesize*(size-1). Thanks @ptrblck for the pointer

2 Likes

That was a trivial mistake… I wasn’t aware of it at all lol.

Thank you for your patient discussions.