How to rigorously justify computational complexity for sparse/non-standard models? (Manual FLOPs vs. Hardware Energy)
Hi everyone,
I’m currently working on a highly sparse, non-standard neural network architecture. Standard profilers’ (like thop, fvcore, torch.profiler) reports were all different (and mostly unhelpful), I guess because my model relies entirely on tensor indexing and element-wise operations (e.g., +, *, tanh, exp) rather than standard dense GEMMs or Convolutions.
Since I cannot rely on off-the-shelf tools, I need to manually justify my model’s computational cost for an upcoming paper submission. However, I am facing a dilemma on how to formalize this, and I am considering two approaches:
Approach A: Manual FLOPs Counting I built a custom tracker to intercept every element-wise operation and accumulate FLOPs based on tensor.numel().
-
The issue with references: In existing literature, there seems to be no clear consensus for non-standard models. Some papers simply report “0 MACs” and ignore element-wise ops (which feels like cheating since they are the actual bottleneck here), while others manually calculate Big-O complexity. Is there a “gold standard” reference or accepted methodology for reporting manual FLOPs in top-tier ML venues?
-
Practical weight assignment dilemma: When manually assigning FLOP costs to specific operations, things get highly subjective. For example, in my current draft, I practically assigned costs like:
-
ADD/MUL/CMP= 1 FLOP -
DIV= 4 FLOPs -
TANH/EXP= 8 FLOPs (assuming Look-Up Tables (LUT) or Piecewise Linear (PWL) hardware approximations).
-
-
The transcendental function problem: Is a cost of
8 FLOPsfortanhwidely accepted in the ML community? I worry that reviewers might attack this, arguing that a true FP32tanhorexprequires Taylor series or CORDIC algorithms, costing dozens or hundreds of FLOPs. Should I count the exact mathematical operations of the Taylor series, or is the LUT/PWL assumption (e.g., 1~8 FLOPs) the norm for hardware-aware neural network papers?
Approach B: Hardware Energy Consumption Estimation (pJ) Since treating an INT8 addition and an FP32 Tanh as roughly similar “FLOPs” is fundamentally misleading in terms of actual hardware cost, I am considering bypassing FLOPs entirely.
-
The method: I count the exact occurrences of each operation type and multiply them by their respective energy costs in pico-Joules (pJ).
-
The references: I plan to base this on established hardware specifications, such as the 45nm/7nm chip data from Horowitz (2014) or the EIE paper (Han et al., 2016). Is estimating total energy (pJ) using these classic references still considered a rigorous and highly preferred way to justify efficiency for non-standard architectures today?
My Questions for the Community:
-
For Approach A, what is the community consensus on practical FLOPs counting? Are my assumed costs (
DIV = 4,TANH = 8via LUT) acceptable? Are there any good papers I can cite for this specific methodology and weight assignments? -
Between Approach A (Total FLOPs) and Approach B (Total Energy in pJ), which metric provides a stronger, more bulletproof justification for a paper introducing a highly sparse, element-wise heavy architecture?
Any insights, recommended references, or experiences with profiling such networks would be greatly appreciated!
Thanks in advance!