Issues on pytorch profiler FLOPs counting with with_flops=True

hongsiboy · March 25, 2026, 9:16am

I noticed when using torch.profiler with with_flops=True, it gave me a totally strange outcome, many operations are counted as 0 FLOPs.

According to the official docs(torch.profiler — PyTorch 2.11 documentation) and the docstring in torch/profiler/profiler.py, with_flops is supposed to:

“use formula to estimate the FLOPS of specific operators (matrix multiplication and 2D convolution).”

This is, well pretty disappointing but at the same time understandable cause normally major computation is focused on mm or 2d conv.

However, my profiling results completely contradict this statement. Basic element-wise ops like aten::add and aten::mul are being counted (1 FLOP per element).

I partially checked the source code (torch/profiler/profiler.py) and it seems that it follows the official web document(use formula to estimate the FLOPS of specific operators (matrix multiplication and 2D convolution).

So I ran a test.

Here is the reproducible script and my environment info:

Environment Info:

Python version: 3.9.21
PyTorch version: 2.5.1
CUDA version: 11.8
GPU: NVIDIA GeForce RTX 3090

Experiment Code:

import torch
from torch.profiler import profile, ProfilerActivity

x = torch.rand(1000, 1000, device='cuda')
y = torch.rand(1000, 1000, device='cuda')
x_small = x * 0.9 

# Warm-up
_ = torch.add(x, y); _ = torch.mul(x, y); _ = torch.tanh(x)
_ = torch.min(x, y); _ = torch.sqrt(x); _ = torch.div(x, y)
_ = torch.exp(x); _ = torch.sign(x); _ = torch.atanh(x_small)
_ = torch.sigmoid(x); _ = x > y; _ = torch.mm(x, y)
torch.cuda.synchronize()

# Profiling
with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    with_flops=True,
    record_shapes=True
) as prof:
    torch.add(x, y)           
    torch.mul(x, y)           
    torch.tanh(x)             
    torch.min(x, y)           
    torch.sqrt(x)             
    torch.div(x, y)           
    torch.exp(x)              
    torch.sign(x)             
    torch.atanh(x_small)      
    torch.sigmoid(x)          
    _ = x > y                 
    torch.mm(x, y)            

torch.cuda.synchronize()

target_ops = [
    "aten::add", "aten::mul", "aten::tanh", "aten::minimum", 
    "aten::sqrt", "aten::div", "aten::exp", "aten::sign", 
    "aten::atanh", "aten::sigmoid", "aten::gt", "aten::mm"
]

print(f"{'ATen Operator':<15} | {'Recorded FLOPs':<20} | {'Execution Time':<20}")
print("-" * 65)
for evt in prof.key_averages():
    if evt.key in target_ops:
        cuda_time = getattr(evt, "cuda_time_total", getattr(evt, "self_cuda_time_total", 0))
        cpu_time = getattr(evt, "cpu_time_total", getattr(evt, "self_cpu_time_total", 0))
        time_str = f"{cuda_time:.2f} us (CUDA)" if cuda_time > 0 else f"{cpu_time:.2f} us (CPU)"
        print(f"{evt.key:<15} | {str(evt.flops):<20} | {time_str:<20}")

Output:

Plaintext

ATen Operator   | Recorded FLOPs       | Execution Time      
-----------------------------------------------------------------
aten::add       | 1000000              | 1009.84 us (CPU)    
aten::mul       | 1000000              | 34.76 us (CPU)      
aten::tanh      | 0                    | 28.17 us (CPU)      
aten::minimum   | 0                    | 28.92 us (CPU)      
aten::sqrt      | 0                    | 26.14 us (CPU)      
aten::div       | 0                    | 24.97 us (CPU)      
aten::exp       | 0                    | 23.17 us (CPU)      
aten::sign      | 0                    | 24.59 us (CPU)      
aten::atanh     | 0                    | 22.94 us (CPU)      
aten::sigmoid   | 0                    | 22.55 us (CPU)      
aten::gt        | 0                    | 28.92 us (CPU)      
aten::mm        | 2000000000           | 110.78 us (CPU)     
-----------------------------------------------------------------

My Questions:

Why is the actual behavior contradicting the official documentation? I think it clearly counts basic ops like add and mul which are not matrix multiplications or convolutions.
Is the profiler fundamentally unreliable? What are the most authoritative and widely used alternative tools in the community (e.g., fvcore, ptflops, or Nsight Compute) to accurately measure these theoretical and hardware operation costs?
Does it have to do with my version or environment?

thank you