On Counting FLOPs/Energy for the Model

How to rigorously justify computational complexity for sparse/non-standard models? (Manual FLOPs vs. Hardware Energy)

Hi everyone,

I’m currently working on a highly sparse, non-standard neural network architecture. Standard profilers’ (like thop, fvcore, torch.profiler) reports were all different (and mostly unhelpful), I guess because my model relies entirely on tensor indexing and element-wise operations (e.g., +, *, tanh, exp) rather than standard dense GEMMs or Convolutions.

Since I cannot rely on off-the-shelf tools, I need to manually justify my model’s computational cost for an upcoming paper submission. However, I am facing a dilemma on how to formalize this, and I am considering two approaches:

Approach A: Manual FLOPs Counting I built a custom tracker to intercept every element-wise operation and accumulate FLOPs based on tensor.numel().

  • The issue with references: In existing literature, there seems to be no clear consensus for non-standard models. Some papers simply report “0 MACs” and ignore element-wise ops (which feels like cheating since they are the actual bottleneck here), while others manually calculate Big-O complexity. Is there a “gold standard” reference or accepted methodology for reporting manual FLOPs in top-tier ML venues?

  • Practical weight assignment dilemma: When manually assigning FLOP costs to specific operations, things get highly subjective. For example, in my current draft, I practically assigned costs like:

    • ADD / MUL / CMP = 1 FLOP

    • DIV = 4 FLOPs

    • TANH / EXP = 8 FLOPs (assuming Look-Up Tables (LUT) or Piecewise Linear (PWL) hardware approximations).

  • The transcendental function problem: Is a cost of 8 FLOPs for tanh widely accepted in the ML community? I worry that reviewers might attack this, arguing that a true FP32 tanh or exp requires Taylor series or CORDIC algorithms, costing dozens or hundreds of FLOPs. Should I count the exact mathematical operations of the Taylor series, or is the LUT/PWL assumption (e.g., 1~8 FLOPs) the norm for hardware-aware neural network papers?

Approach B: Hardware Energy Consumption Estimation (pJ) Since treating an INT8 addition and an FP32 Tanh as roughly similar “FLOPs” is fundamentally misleading in terms of actual hardware cost, I am considering bypassing FLOPs entirely.

  • The method: I count the exact occurrences of each operation type and multiply them by their respective energy costs in pico-Joules (pJ).

  • The references: I plan to base this on established hardware specifications, such as the 45nm/7nm chip data from Horowitz (2014) or the EIE paper (Han et al., 2016). Is estimating total energy (pJ) using these classic references still considered a rigorous and highly preferred way to justify efficiency for non-standard architectures today?

My Questions for the Community:

  1. For Approach A, what is the community consensus on practical FLOPs counting? Are my assumed costs (DIV = 4, TANH = 8 via LUT) acceptable? Are there any good papers I can cite for this specific methodology and weight assignments?

  2. Between Approach A (Total FLOPs) and Approach B (Total Energy in pJ), which metric provides a stronger, more bulletproof justification for a paper introducing a highly sparse, element-wise heavy architecture?

Any insights, recommended references, or experiences with profiling such networks would be greatly appreciated!

Thanks in advance!

  1. The cost of higher-level operations like div and tanh highly depends on the hardware and the data type. For example, you can look at the instructions for Nvidia Hopper (SM 90) here : Compiler Explorer . DIV can be somewhat inexpensive (1 MUFU instruction, 6 floating-point instructions, 1 branching set), but there is a slow path for specific cases (usually denormals and the like) which would be much more expensive than tanh. I think that it would make total sense if you simply provide something like the link above as reference (or just embed the instructions in an appendix in your paper), to justify how many FLOPs those operations take.
  2. Getting a good estimate of energy consumption per high-level operation is significantly more complicated than just counting the type and number of operations. Hardware vendors won’t be interested in publishing these numbers for competitive reasons, so there is no way you will get great estimates for modern hardware, unless you ask the hardware vendor to provide the numbers directly in private.
1 Like

Thank you so much for your advice! It really helped. I can’t thank you more

one more thing that just came to mind, and it’s shameless plug from me :smiley:
We recently re-did this entire table (Table 5 here: CUDA Best Practices Guide — CUDA C++ Best Practices Guide 13.2 documentation) with throughputs for various operations on currently supported CUDA HW.
Now this doesn’t give you very precise values for each operation (e.g. branching won’t appear), but it does give you some more ideas. So e.g. we mention “32-bit approximate floating-point reciprocal, reciprocal square root” which corresponds to those MUFU instructions I mentioned above. The fully precise version of these operations requires additional correction, which is what my link above is showing for e.g. division. But once you have simply counted the number of instructions in each category, you can combine this and weigh by the throughput and it should give you a fairly decent idea of overall throughput.

1 Like