Understanding GFLOPS vs number of parameters

I am working on real-time scenario where inference time is extremely important. I am running the model on a Nvidia GPU and the models will be converted to TensorRT.

I am comparing:

Squeezenet 1.1

Mobilenet v3 small

It seems that squeezenet size is approximately 48% of mobilenet. However, I am sure how to interpret the GFLOPS. I know that GFLOPS represent the number of floating point operations per second.
Since squeezenet has a greater GFLOPS, would this mean that it’s architecture can be executed faster? In other words, if the models were to have the same number of parameters, would the one with greater GFLOPS be the faster?

Indeed. The number of parameters is something aligned to the expressiveness of the model. Comparing two networks with the same underlaying technology, the more parameters, the larger distributions it can learn. While there is a correlation between number of parameters and GFLOPs it’s not strictly related.

A typical example are dense residual conections:

Here, densenet and resnet have the same number of parameters but densenet requires more computations.

Or imagine you run the same ResNet twice, but in one case you add pooling (which would reduce the size thus the GFLOPs).

In summary, there is a correlation between size and computational cost, but a fine design can effectively reduce the computations.

Thank you for your reply, @JuanFMontesinos .
Your example helped me a lot.
I still have a doubt. When speaking about GFLOPS on PyTorch website, what are the exactly measuring? Are these the number of floating point operations to do inference on a single image?
Or do they count the number of floating point operations per second when running the model on a specific hardware?
Are GFLOPS and GFLOPs the same thing?
Thank you

I don’t look at pytorch’s examples so I cannot respond to that.

FLOPs are Floating-point operations but that’s a bit ambiguous as described

When dealing about computing effort and computing speed (hardware performance), terminology
is usually confusing. For instance, the term ‘compute’ is used ambiguously, sometimes applied to
the number of operations or the number of operations per second. However, it is important to clarify
what kind of operations and the acronyms for them. In this regard, we will use the acronym FLOPS
to measure hardware performance, by referring to the number of floating point operations per second, as standardised in the industry, while FLOPs will be applied to the amount of computation for
a given task (e.g., a prediction or inference pass), by referring to the number of operations, counting
a multiply-add operation pair as two operations. An extended discussion about this can be found in
the appendix.

In deep learning these floating point ops are usually just multiply-add ops. As you can read, it requires certain clarification but the general idea is to provide hardware-agnostic ways of showing computational cost.

For example, some papers show inference time, but this number obviously depends on the hardware and the batch size (so that the gpu is working at its maximum/optimal workload). Similarly, you can say how many imgs (in computer visioon) per unit of time you can process. Yet again, this depends on image size, batch size, hardware…

So in the end the most agnostic way (although the hardest) of showing this info is with the required amount of operations. If I my algorithm can sum 2 numbers in 5 steps and yours in 7, mine is better.

In your case, the simplest case would be doing your own tests. Also note that compiled models can benefit from improvements (there is torch.compile now). For example transformers now have flask attention in pytorch and so on…

My advice is:
Just take the metric that fits better your problem (inference time, imgs/second) and read a bit how to make a fair comparison.