Hi, I’m working with following script for benchmarking my RTX 3080 GPU. In case of torch.float16, torch.matmul is performed with 29 [TFLOPS]. But, in case of torch.float32, torch.matmul is performed with only 9.8 [TFLOPS]. Because the theoretical performance of RTX 3080 is 29.77 [TFLOPS] and the GPU has no half units, the results may be something wrong.
Do you have any information on this?
import torch
import time
torch.backends.cudnn.benchmark = True
size = 8192*2
repeat = 100
with torch.no_grad():
x = torch.randn(size, size, device=torch.device("cuda"), dtype=torch.float32)
w = torch.randn(size, size, device=torch.device("cuda"), dtype=torch.float32)
torch.cuda.synchronize()
start = time.time()
for i in range(repeat):
y = torch.matmul(x, w)
torch.cuda.synchronize()
print(time.time() - start, " secs")
print(size**3 / ((time.time() - start) / repeat) / 1e9, " GFLOPS")
I would also recommend moving the second torch.cuda.synchronize() out of the for loop (so you synchronize only once after running all iterations) and see if that helps somewhat.
I think that the first one is always low performance because of memory transferring or something.
And this settings, results of float32 case is as follows:
As you can see from the results, the suggestion to put torch.cuda.synchronize() outside the loop did not work. (In fact, when I checked the results with your suggestion before modifying the script, results was not changed.)
I am able to get ~16TFLOPs with a slightly modified script, but that is still only ~50% of “speed-of-light”
import torch
import time
torch.backends.cudnn.benchmark = True
torch.set_float32_matmul_precision('high')
size = 8192*2
repeat = 20
with torch.no_grad():
x = torch.randn(size, size, device=torch.device("cuda"), dtype=torch.float)
w = torch.randn(size, size, device=torch.device("cuda"), dtype=torch.float)
y = torch.matmul(x, w)
torch.cuda.synchronize()
start = time.perf_counter()
for i in range(repeat):
y = torch.matmul(x, w)
torch.cuda.synchronize()
end = time.perf_counter()
print(end - start, " secs")
print(size**3 / ((end - start) / repeat) / 1e9, " GFLOPS")
print("now using CUDA Graphs...")
with torch.no_grad():
x = torch.randn(size, size, device=torch.device("cuda"), dtype=torch.float)
w = torch.randn(size, size, device=torch.device("cuda"), dtype=torch.float)
s = torch.cuda.Stream()
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
for i in range(3):
y = torch.matmul(x, w)
torch.cuda.current_stream().wait_stream(s)
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
y = torch.matmul(x, w)
torch.cuda.synchronize()
start = time.perf_counter()
for i in range(repeat):
g.replay()
torch.cuda.synchronize()
end = time.perf_counter()
print(end - start, " secs")
print(size**3 / ((end - start) / repeat) / 1e9, " GFLOPS")
5.425337762571871 secs
16212.986927542423 GFLOPS
now using CUDA Graphs...
5.438446355983615 secs
16173.90785243318 GFLOPS
I will follow up to see if there is an explanation for the less than expected performance.